Figure 2 and supplementary figure 2

Author

Niccolò Arecco & Ivano Mocavini

Last code execution: 2023 September 08, Friday @ 16:09:54.

1 Intro

Using the proteomicslfq nextflow (Di Tommaso et al. 2017) pipeline I processed the thermo raw files I received from the CRG/UPF proteomics facility and calculate the peptide spectral matches (PSMs) against a protein database I build from Mouse UniProt (version 2022_03 3th Aug 2022) using both canonical and alternative isoforms as well as common contaminants.

The pipeline was run as following on the CRG cluster.

Code
nextflow run main.nf -profile singularity -with-tower -c crg.config -bg \
                   --input "../data/raw/SUZ12IP_DExon4_WT/*.raw" \
                       --database ../data/reference/proteome/Mouse_prot_database.fasta \
                   --expdesign ../data/exp_design/Exon4_vs_WT_ESCs.tsv \
                   --outdir Suz12_Dex4_WT --allowed_missed_cleavages 3 \
                   --search_engines comet,msgf --enzyme Trypsin --num_hits 3 \
                   --add_decoys --skip_post_msstats \
                   --protein_quant 'shared_peptides' \
                   --protein_inference 'bayesian' \
                   --quantification_method 'spectral_counting'
                   
nextflow run main.nf -profile singularity -with-tower -c crg.config -bg \
                   --input "../data/raw/SUZ12_FLAGIP_SKO_Rescues/SKO*.raw" \
                       --database ../data/reference/proteome/Mouse_prot_database.fasta \
                   --expdesign ../data/exp_design/Suz12KO_vs_Rescues_ESCs.tsv \
                   --outdir Suz12_Flag_Rescues --allowed_missed_cleavages 3 \
                   --search_engines comet,msgf --enzyme Trypsin --num_hits 3 \
                   --add_decoys --skip_post_msstats \
                   --protein_quant 'shared_peptides' \
                   --protein_inference 'bayesian' \
                   --quantification_method 'spectral_counting'

The parameter --search_engines comet,msgf indicates that the proteins are searched using MS-GF+ (Kim and Pevzner 2014) and Comet (Eng, Jahan, and Hoopmann 2013) engines.

The option --protein_quant 'shared_peptides' indicates to quantify proteins based on peptides mapping to single and multiple proteins too, but only mapping peptides greedily for its best group (by inference score). This inference is controlled by --protein_inference 'bayesian' that indicates in order to group proteins, it calculate scores on the protein (group) level and to potentially modify associations from peptides to proteins using bayesian statistics.

The --quantification_method 'spectral_counting' indicated that spectral counting of PSMs is used to quantify the protein intensities.

First I copy the output quantification pipeline out.mzTab files to a local folder.

Code
mkdir -p ~/OneDrive\ -\ CRG\ -\ Centre\ de\ Regulacio\ Genomica/Suz12_AS_project/_Code/Fig2/tables

cp ~/mnt/narecco/projects/07_Suz12AS/proteomicslfq/Suz12_Flag_Rescues/proteomics_lfq/out.mzTab ~/OneDrive\ -\ CRG\ -\ Centre\ de\ Regulacio\ Genomica/Suz12_AS_project/_Code/Fig2/tables/Suz12_Flag_Rescues.mzTab

cp ~/mnt/narecco/projects/07_Suz12AS/proteomicslfq/Suz12_Dex4_WT/proteomics_lfq/out.mzTab ~/OneDrive\ -\ CRG\ -\ Centre\ de\ Regulacio\ Genomica/Suz12_AS_project/_Code/Fig2/tables/Suz12_Dex4_WT.mzTab

Now I process the PSMs using DEP R package (Zhang et al. 2018) to perform differential enriched proteins analysis.

2 Set Up

2.1 Packages

Load packages required for the analysis and suppress any message. Check the Section 5 section at the end for package versions.

Code
library(readr)
library(readxl)
library(tibble)
library(tidyr)
library(stringr)
library(dplyr, warn.conflicts = F, quietly = T)
library(ggplot2)
library(ggrepel)
library(ggsignif)
library(viridis, quietly = T)
library(patchwork)
library(plotly)
library(SummarizedExperiment)
library(Biostrings)
library(MSnbase)
library(UniprotR)
library(DEP)
library(knitr)
library(UpSetR)

options(dplyr.summarise.inform = F)

2.2 Functions

Define a plot style.

Code
# Define the ggplot2 themes used for the figures.
theme_classic(base_family = "Arial", base_size = 6) +
  theme(axis.line = element_line(linewidth = 0.15, colour = "black"),
        axis.text = element_text(colour = "black", size = 6),
        axis.text.x = element_text(margin = margin(r = -2, unit = "mm")),
        axis.title = element_text(colour = "black", size = 6),
        axis.title.x = element_blank(),
        axis.title.y = element_text(vjust = 1, 
                                    margin = margin(r = 0, unit = 'mm')),
        axis.ticks = element_line(colour = "black"),
        axis.ticks.x = element_blank(),
        axis.ticks.length.y = unit(1, units = "mm"),
        strip.text.x = element_text(vjust = -1.5, size = 6),
        strip.background = element_blank(),
        legend.position = c(0.95, 0.85),
        legend.title = element_blank(),
        legend.text = element_text(colour = "black", size = 6),
        legend.background = element_blank(),
        legend.key.height = unit(x = 1, units = "mm"),
        legend.key.width = unit(x = 1, units = "mm"),
        panel.grid.major.y = element_line(colour = 'grey73', linewidth = 0.1),
        panel.background = element_blank(),
        plot.background = element_blank()) -> MSBar_plot_theme

theme_classic(base_size = 6, base_family = "Arial") +
    theme(axis.line = element_line(linewidth = 0.15, colour = "black"),
          axis.text = element_text(colour = "black", size = 6),
          axis.text.x = element_text(margin = margin(r = -2, unit = "mm")),
          axis.title = element_text(colour = "black", size = 6),
          axis.title.y = element_text(vjust = 1, 
                                      margin = margin(r = 0, unit = 'mm')),
          legend.position = "bottom",
          legend.title = element_text(vjust = 1),
          legend.background = element_blank(),
          panel.grid.major = element_line(colour = 'grey73', linewidth = 0.1),
          panel.background = element_blank(),
          plot.background = element_blank()) -> pep_coverage_theme

theme_classic(base_family = "Arial", base_size = 6) +
  theme(legend.position = "none",
        axis.line = element_line(linewidth = 0.15, colour = "black"),
        axis.text = element_text(colour = "black", size = 6),
        axis.title = element_text(colour = "black", size = 6),
        axis.title.y = element_text(vjust = 1, 
                                    margin = margin(r = 0, unit = 'mm')),
        axis.ticks = element_line(colour = "black"),
        axis.ticks.length = unit(1, units = "mm"),
        strip.background = element_blank(),
        panel.grid.major = element_line(colour = 'grey73', linewidth = 0.1),
        plot.background = element_blank()) -> Volcano_plot_theme 

Parse PSM peptide signal in an MzTab file

Code
read_psm_MzTab <- function(MzTab_path, POI_UniProtID, parse_accession = T) {
  
  
  cols_to_keep <- c('sequence', 'PSM_ID', 'accession', 'modifications', 
                    'retention_time', 'charge', 'exp_mass_to_charge',
                    'calc_mass_to_charge','pre', 'post', 'start', 'end',
                    'opt_global_Posterior_Error_Probability_score',
                    'opt_global_q-value')
  psm_raw <- readMzTabData(file = MzTab_path, what = "PSM", version = "1.0", verbose = T)
  
  fData(psm_raw) |> 
    as_tibble() |>
    mutate(is_target = grepl(pattern = POI_UniProtID, x = accession) ) |>
    subset(is_target) |>
    # use only uniquely mapping peptides
    subset(unique == 1) |>
    select(all_of(cols_to_keep)) -> psm
    
  # remove 'opt_global_' on the fly from the column names
  colnames(psm) <- gsub('opt_global_', '', colnames(psm) )
  colnames(psm) <- gsub('\\-', '_', colnames(psm) )
  
  psm |>
    mutate(across( c(start, end), as.integer )) |>
    arrange(start, end) -> psm
  
  if (parse_accession == TRUE) {
    
    psm |> 
      mutate(SwissProt = str_split_fixed(string = accession, pattern = '\\|', n = 3)[,1],
             UniProtID = str_split_fixed(string = accession, pattern = '\\|', n = 3)[,2],
             Prot_Species = str_split_fixed(string = accession, pattern = '\\|', n = 3)[,3],
             .before = accession) |>
      mutate(Protein = str_remove(string = Prot_Species, pattern = '_MOUSE'),
             .before = accession) |>
      select(!c(SwissProt, Prot_Species, accession)) -> psm_tidy
    
    if ( length(unique(psm_tidy$UniProtID)) == 1 ) {
      if ( unique(psm_tidy$UniProtID) != POI_UniProtID ) {
        stop('Something is wrong with the UniProtID parsed from the accession')
      } 
    } else if( length(unique(psm_tidy$UniProtID)) >= 2) {
      message('Parsed more than one UniProtID, double check the results!')
    } else {
      stop('Not sure how many UniProtID where parsed...' )
    }
    
    
  } else{
    psm_tidy <- psm
  }
  
  # remove start and end NA
  psm_tidy <- subset(psm_tidy, !is.na(start))
  psm_tidy <- subset(psm_tidy, !is.na(end))
  
  return(psm_tidy)
}

Read Saint output data and tidy the data

Code
read_saint_peps <- function(saint_path, UniProtID, AA_exon_start, AA_exon_end, eoi_name = 'exonXYZ') {
  peptides <- read_excel(path = saint_path, sheet = "Peptides", trim_ws = TRUE)
  colnames(peptides) <- gsub("# ", "Num_", colnames(peptides))
  colnames(peptides) <- gsub("\\+ ", "plus_", colnames(peptides))
  colnames(peptides) <- gsub(": ", "_", colnames(peptides))
  colnames(peptides) <- gsub(" ", "_", colnames(peptides))
  colnames(peptides) <- gsub("Exp_Value_", "ExpValue_", colnames(peptides))
  # rename _Area column
  rename_area_col <- grep(pattern = "2_Area", x = colnames(peptides))
  colnames(peptides)[rename_area_col] <- paste0("Area_", paste0(LETTERS[1:length(rename_area_col)], 2) )
  # rename "score" column
  rename_score_col <- grep(pattern = "^[A-F]4$", x = colnames(peptides))
  colnames(peptides)[rename_score_col] <- paste0("Score_", colnames(peptides)[rename_score_col] )
  # Remove number
  colnames(peptides) <- gsub(pattern = "[2|4]$", "", colnames(peptides))

  # Map peptides to proteins
  Prot_seq <-  UniprotR::GetSequences(ProteinAccList = UniProtID)$Sequence
  Prot_seq <- AAString(x = Prot_seq)
  Prot_aa <- nchar(Prot_seq)
  
  # extract rows with Suz12 peptides
  get_indx <- grep(pattern = UniProtID, x = peptides$Protein_Group_Accessions)
  
  Prot_pep <- peptides[get_indx, ]
  pep_seq <- Prot_pep$Sequence
  names(pep_seq) <- Prot_pep$Sequence
  Prot_pep_seq <- AAStringSet(x = Prot_pep$Sequence, use.names = T)
  
  # Extract peptides coordinates on protein
  # Which peptide gets mapped to Suz12 seq
  whichPDict(pdict = Prot_pep_seq, 
             subject = Prot_seq, 
             max.mismatch = 0,
             with.indels = F) -> peptides_mapped
  
  message(length(Prot_pep_seq), " peptides could map\n",
          length(peptides_mapped), " actually map precisely")
  
  # Extract peptides interval in protein
  matchPDict(pdict = Prot_pep_seq[peptides_mapped], 
             subject = Prot_seq, 
             max.mismatch = 0,
  ) |> unlist() |> as.data.frame() |>
    # Bind peptide coordinates to the sub table with all the info for each peptide
    cbind(Prot_pep[peptides_mapped, ]) -> pep_cov
  
  # Add exon info
  pep_cov <- mutate(
    pep_cov,
    Exon = case_when(
      between(start, AA_exon_start, AA_exon_end) ~ eoi_name,
      between(end, AA_exon_start, AA_exon_end) ~ eoi_name,
      !between(start, AA_exon_start, AA_exon_end) ~ "other_exons",
      !between(end, AA_exon_start, AA_exon_end) ~ "other_exons"
    )
  )
  
  return(pep_cov)
}

Parse and map peptides positions

Code
cov2res_peptides <- function(pep_cov) {
  pep_res <- pep_cov |>
    dplyr::select(-starts_with("Score")) |>
    pivot_longer(cols = starts_with(c("Area", "ExpValue", "IonScore")),
                 names_sep = "_",
                 names_to = c("Parameter", "Sample"), 
                 values_to = "Value")
  
  pep_score <- pep_cov |>
    dplyr::select(1:10, starts_with( c("Score", "Exon"))) |>
    unique() |>
    pivot_longer(cols = starts_with("Score"),
                 names_to = "Sample",
                 names_prefix = "Score_",
                 values_to = "Score",
    ) |>
    select(12, 13, 1:3, 11, 4:10) |>
    arrange(start)
  
  pep_res <- left_join(pep_res, pep_score, 
                       by = join_by("start", "end", "width", "Sequence", "Num_PSMs",
                                    "Num_Proteins", "Num_Protein_Groups",
                                    "Protein_Group_Accessions", "Modifications",
                                    "MHplus_[Da]", "Exon", "Sample")) |>
    arrange(start) |>
    relocate(Sample, .before = start)
  # Factorize peptides sequences by their position in the protein
  pep_res$Pep_Sequence <- pep_res$Sequence
  pep_res$Pep_Sequence <- factor(pep_res$Pep_Sequence, 
                                 levels = unique(rev(pep_res[order(pep_res$start),]$Sequence) ) )
  return(pep_res)
}

Little helper function to plot peptide counts in exon of interest

Code
summary_exon_coverage <- function(pep_cov_df) {
  pep_cov_df |>
    select(Sequence, Num_PSMs, Exon) |>
    unique() |>
    group_by(Exon) |>
    summarise( PSMs = sum(Num_PSMs), .groups = "drop") -> pep_cov_df
    
  sum_psms <- sum(pep_cov_df$PSMs)
  add_row(pep_cov_df, Exon = "total_protein", PSMs = sum_psms) |>
    setNames(c("Region", "Pepetide_Spectrum_Matches"))|>
    as_tibble() -> summary_cov
  
  return(summary_cov)
}

Plotting function for peptides coverage and exon of interest highlight.

Code
plot_pep_cov <- function(df_res, AA_exon_start, AA_exon_end, Param_plot = 'Area') {
  
  subset(df_res, Parameter == Param_plot) |>
    subset(!is.na(Value)) |>
    subset(Value > 0 ) |>
    ggplot()  + 
    aes(x = start, xend = end, y = Sample_Name, yend = Sample_Name, color = Value) +
    geom_segment(size = 1.5) +
    geom_vline(xintercept = AA_exon_start, size = 0.2) +
    geom_vline(xintercept = AA_exon_end, size = 0.2) +
    labs(x = "Amino acids position", y = "") +
    scale_x_continuous(expand = expansion(add = 0.1), 
                       n.breaks = 10, breaks = waiver(), 
                       limits = c(0, max(df_res$end)+1)) +
    scale_colour_viridis(name = paste0("Peptide ", Param_plot),
                         trans = "log10",
                         breaks = c(1e7, 3e7, 1e8, 3e8, 1e9, 3e9, 1e10, 3e10) )  +
    guides(colour = guide_colourbar(barwidth = grid::unit(12, "cm"), 
                                    barheight = grid::unit(2, "mm") ) ) +
    pep_coverage_theme -> p_coverage
  
    return(p_coverage)
}

Helper functions for DEP analysis

Code
# Function to extract number of DE proteins
DE_prots <- function(results) {
  tibble(Dataset = gsub("_results", "", results),
         Num_Signif_Proteins = get(results) |> filter(significant) |> nrow(),
         Proteins_Names = get(results) |> filter(significant) |> pull(name) 
         |> paste0(collapse = " / ")
         )
}

# Function that wraps around test_diff, add_rejections and get_results functions
DE_analysis <- function(se, signif_thrshld = 0.05) {
    data_diff <- test_diff(se, type = "manual", test = 'WT_vs_Dexon4')
    res <- get_results(add_rejections(data_diff, alpha = signif_thrshld, lfc = 1.25))
    return(res)
}

# Function to obtain ROC data
get_ROC_df <- function(results) {
  get(results) |>
  select(name, WT_vs_Dexon4_p.val, significant) |>
  mutate(
    DE = grepl(T, significant),
    BG = grepl(F, significant)) |>
  arrange(WT_vs_Dexon4_p.val) -> tmp
  
  mutate(tmp,
      TPR = cumsum(as.numeric(DE)) / length(which(tmp$DE)),
      FPR = cumsum(as.numeric(BG)) / length(which(tmp$BG)),
      method = results) -> tmp
  return(tmp)
}

# Function that wraps around test_diff, add_rejections and get_results functions
DE_analysis_KOrL_KOrS <- function(se, signif_thrshld = 0.05) {
    data_diff <- test_diff(se, type = "manual", test = 'KOrL_vs_KOrS')
    res <- get_results(add_rejections(data_diff, alpha = signif_thrshld, lfc = 1.5))
    return(res)
}

2.3 Directories & File Paths

Here I organise all the variables I need to run the analysis and define where to save the processed tables and figures.

Code
oneDrive_Dir <- file.path("~/OneDrive - CRG - Centre de Regulacio Genomica/Suz12_AS_project")  
code_dir_fig2 <- file.path(oneDrive_Dir, "_Code/Fig2")

tbl_dir_fig2 <- file.path(code_dir_fig2, "tables")
pdf_dir_fig2 <- file.path(code_dir_fig2, "pdfs")

if (!dir.exists(pdf_dir_fig2)) { dir.create(pdf_dir_fig2, recursive = T) }

Set path to the skyline output of the targeted proteomics experiment

Code
PRM_skyline_output_path <- file.path(tbl_dir_fig2, "PRM_Target_Proteomics_SkylineOutput.xlsx")

Set paths of IP-MS mzTab files.

Code
WT_Dex4_filename <- "Suz12_Dex4_WT"
Resce_Suz12KO_filename <- "Suz12_Flag_Rescues"

WT_Dex4_mzTAB_path <- list.files(tbl_dir_fig2, 
                                 pattern = paste0(WT_Dex4_filename, ".mzTab"),
                                 full.names = T)

Resce_KO_mzTAB_path <- list.files(tbl_dir_fig2, 
                                  pattern = paste0(Resce_Suz12KO_filename, ".mzTab"),
                                  full.names = T)

stopifnot(file.exists(WT_Dex4_mzTAB_path))
stopifnot(file.exists(Resce_KO_mzTAB_path))

Import metadata for these experiments.

Code
WT_Dex4_Metada_path <- list.files(tbl_dir_fig2, 
                                   pattern = paste0(WT_Dex4_filename,
                                                    "_Experimental_Design_Metadata.xlsx"),
                                   full.names = T)

Resce_KO_Metada_path <- list.files(tbl_dir_fig2, 
                                   pattern = paste0(Resce_Suz12KO_filename,
                                                    "_Experimental_Design_Metadata.xlsx"),
                                   full.names = T)

Saint output file paths

Code
saint_path_WT_dEx4 <- file.path(tbl_dir_fig2, "Suz12_IP-MS_mESCs_WT_vs_dExon4_Saint.xlsx")
saint_path_KO_Rescues <- file.path(tbl_dir_fig2, "Suz12_IP-MS_mESCs_Suz12KO_vs_Rescues_Saint.xlsx")
stopifnot(file.exists(saint_path_WT_dEx4))
stopifnot(file.exists(saint_path_KO_Rescues))

3 Main Figure Panels

3.1 Targeted proteomic peptide abundance

Define the peptides of specific to SUZ12-L, SUZ12-S and a well abundant common peptide.

Code
inclusio_peptide <- "TFKVDDMLSKVEK"
skipping_peptide <- "SLSAHLQLTFTGFFHK"
common_peptide <- "NLIAPIFLHR"

Import common peptide signal

Code
peptides <- c(common_peptide, inclusio_peptide, skipping_peptide)
Common_Precursor <- read_excel(path = PRM_skyline_output_path, 
                               sheet = "Common_Peptide_Q80U70_E9PW18",
                                ) |>
  pivot_longer(cols = !c("Precursor_Ion"), names_sep = "_", 
               names_to = c("Sample_Type", "Replicate"), values_to = "Area_Precursor") |>
  subset(Replicate %in% c(2, 4))  |>
  mutate(Sample_Type = ifelse(grepl(Replicate, pattern =  2), yes = "WT", no = "∆ex4"))  |>
  select(c(Precursor_Ion, Sample_Type, Area_Precursor))

Import individual fragments areas extracted with Skyline (MacLean et al. 2010).

Code
PRM <- read_excel(path = PRM_skyline_output_path, sheet = "Skyline_Output", 
                  .name_repair = function(col){ gsub(" ", "_", col) } ) # change spaces to underscore

Calculate the fragments areas sum.

Code
PRM |>
  subset(Replicate %in% c(2, 4)) |>
  subset(Peptide %in% c(inclusio_peptide, skipping_peptide) ) |>
  subset(Precursor_Charge == 3)  |>
  subset(grepl("^y|b", x = Fragment_Ion)) |>
  subset(Fragment_Ion != "y14") |>
  mutate(Peak_Rank = Peak_Rank - 1) |>
  mutate(Sample_Type = ifelse(grepl(Replicate, pattern =  2), yes = "WT", no = "∆ex4")) |>
  mutate(Sample_Type = factor(Sample_Type, levels = c("WT", "∆ex4")) ) |>
  group_by(Peptide, Replicate, Sample_Type) |>
  select(c(Peptide, Replicate, Precursor_Charge, Fragment_Ion, Area, Peak_Rank, Sample_Type) ) |>
  dplyr::summarise(Area_Sum = sum(Area)) |>
  relocate(Sample_Type, .before = Peptide) -> framents_area_sum

Normalise the area sums to the area of the common SUZ12 isoforms precurson

Code
left_join(framents_area_sum, Common_Precursor, by = "Sample_Type" ) |>
  mutate(Norm_Area = Area_Sum / (Area_Precursor/10^3)) |>
  mutate(Peptide = factor(Peptide, levels = c(inclusio_peptide, skipping_peptide))) |>
  mutate(Sample_Type = factor(Sample_Type, levels = c("WT", "∆ex4"))) -> PRM_areas

Plot Figure 2D.

Code
ggplot(PRM_areas)+
  aes(x = Sample_Type, y = Norm_Area, fill = Peptide) +
  facet_wrap(~Peptide, scale = "free_y") +
  geom_col(show.legend = F, lwd = 0.1, colour = 'black', width = 0.75) +
  scale_fill_manual(values = c("royalblue3", "goldenrod")) +
  labs(y = "fragments / common peptide ") +
  coord_cartesian(clip = 'off') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.02), add = 0),
                     n.breaks = 5) +
  theme_classic(base_size = 6, base_family = "Arial") +
  theme(panel.grid.major = element_line(linewidth = 0.15, colour = "gray84"),
        axis.text = element_text(colour = "black"),
        axis.title = element_text(size = 5),
        axis.title.x = element_blank(),
        axis.line = element_line(linewidth =  0.1),
        plot.title = element_text(size = 5, hjust = 0.5, margin = margin(b = -1, unit = "mm")),
        axis.ticks = element_line(linewidth = 0.1),
        axis.ticks.x = element_blank(),
        panel.background = element_blank(),
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        strip.background.x = element_blank(),
        strip.text = element_text(family = "Courier", size = 4),
        strip.background = element_blank()) -> p_peptides_areas
p_peptides_areas

Save bar plot to pdf.

Code
ggsave(filename = "Fig2D_Norm_Peptides_Area_Barplot.pdf", plot = p_peptides_areas,
       device = cairo_pdf, path = pdf_dir_fig2,  units = "cm",
       width = 3.5, height = 2.8)

3.2 WT and ∆exon4 Suz12 IP

Here I analyse 2 datasets. First a Suz12 IP in mESCs of WT and ∆exon4 cell lines, then a Flag IP of Suz12 KO and Suz12 KO rescued with either long or short isoforms, also in mESCs.

3.2.1 Import PSMs from mzTab file

From the proteomicslfq pipeline the important mzTAB file and do some tidying up of the data.

Code
prot_WT_Dex4 <- readMzTabData(file = WT_Dex4_mzTAB_path, what = "PRT", version = "1.0")

exprs(prot_WT_Dex4) |>
  as.data.frame() |>
  rownames_to_column(var = "info") |>
  setNames(c("info", "WTp", "WT#1", "WT#2", "dEX4#1", "dEX4#2", "dEX4#3") ) |>
  tidyr::drop_na() |> 
  mutate(swissprot = str_extract(pattern = "^[s-t][p|r]", string = info)) |>
  # extract the UniProt ID while preserving alternative isoform ID (dashed number)
  mutate(UniProtID = str_extract(pattern = "(?<=^[s-t][p|r].)[A0-Z9](.*?)\\.([1-9]?)", string = info)) |> 
  mutate(UniProtID = sub(pattern = "\\.", replacement = "-", x = UniProtID)) |>
  mutate(UniProtID = sub(pattern = "-$", replacement = "", x = UniProtID)) |>
  # Extract species name
  mutate(Species = str_extract(pattern = "_[A0-Z9](.*?)(?=\\.1)", string = info)) |>
  mutate(Species = sub(pattern = "^_", replacement = "", x = Species)) |>
  # Not really the protein name, it's an abbreviation used by uniprot
  mutate(ProteinName = str_extract(pattern = "(?<=^[s-t][p|r]\\.[A0-Z9].....).*?\\.[A0-Z9](.*?)(?=_)",
                                   string = info)) |>
  mutate(ProteinName = str_remove(string = ProteinName, pattern = "(.*?)\\.")) |>
  mutate(ProteinName = str_remove(string = ProteinName, pattern = "([1-9]?)\\.")) |>
  mutate(ProteinName = gsub(pattern = "\\.", "-", x = ProteinName )) |>
  mutate(ProteinName = str_to_title(string = ProteinName) ) -> tidy_prot_wt_dEx4

Here I manually fix 2 protein names. First Jarid2 is called Jard2, this is because it comes from JARD2_MOUSE name used by UniProt. The second is the protein called EED which in this experiment is mapped to the not so well annotated protein with ID A0A5F8MPX8. After some digging into the mzTAB file I believe this is happening because the proteomics engines map the peptide RLGAICDSGGGGGGGGAGSFAAGSGR to A0A5F8MPX8 which cannot be mapped to EED as this seems to be in a longer N-terminus stretch compared to the canonical EED.

Code
tidy_prot_wt_dEx4 |>
  # Fix Jarid2 name that comes from JARD2_MOUSE
  mutate(ProteinName = sub(pattern = "Jard2", replacement = "Jarid2", x = ProteinName)) |>
  mutate(ProteinName = ifelse(test = UniProtID == "A0A5F8MPX8", yes = "Eed", no = ProteinName)) |>
  relocate(swissprot, UniProtID, Species, ProteinName, .after = info) -> tidy_prot_wt_dEx4

Other mis-labelled protein is for example A0A3B2WBM3 which actually corresponds to Elongin-B. Probably the mis-labelling is for the same reason, so I fix it like in the case of Eed adding the PSMs to the other Elob protein with UniProtID P62869.

Code
tidy_prot_wt_dEx4 |>
  # Fix Elongin-B
  mutate(ProteinName = ifelse(test = UniProtID == "A0A3B2WBM3", yes = "Elob", no = ProteinName)) |>
  subset(ProteinName == "Elob") |>
  mutate(across(.cols = c("WTp", "WT#1", "WT#2", "dEX4#1", "dEX4#2", "dEX4#3"), .fns = sum )) |>
  # Of the 2 IDs keep only the sp one
  subset(swissprot == "sp") -> tidy_Elob_sum

# remove the Elob individual quantification and add the summed values
tidy_prot_wt_dEx4 |>
  subset(ProteinName != "Elob") |>
  rbind(tidy_Elob_sum) -> tidy_prot_wt_dEx4

3.2.2 Calculate a stoichiometry score for PRC2 proteins.

Select PRC2 components.

Code
core <- c("Suz12", "Eed", "Ezh2", "Rbbp7", "Rbbp4", "Ezh1" )
PRC2.1 <- c("Mtf2", "Epop", "Phf19", "Elob", "Eloc") # Tceb = Elongin b/c
PRC2.2 <- c("Jarid2", "Aebp2") # "Ezhip"
PRC2 <- c(core, PRC2.1, PRC2.2)

Subset for these proteins.

Code
PRC2_prot_wt_dEx4 <- subset(tidy_prot_wt_dEx4, ProteinName %in% PRC2)

Use the UniprotR package to query UniProt and get protein length. This could be a bit time consuming.

Code
PRC2_prot_wt_dEx4$Protein_Length <- UniprotR::GetSequences(ProteinAccList = PRC2_prot_wt_dEx4$UniProtID)$Length

Reshape the data.

Code
PRC2_prot_wt_dEx4 |>
  pivot_longer(cols = c(starts_with("WT"), starts_with("dEX4") ),
               names_to = "Sample", values_to = "Spec") |>
  mutate(Condition = case_when( grepl(x = Sample, pattern = "WT") ~ "WT",
                                grepl(x = Sample, pattern = "dEX4") ~ "∆ex4") ) |>
  relocate(Condition, .after = Sample) |>
  mutate(Condition = factor(Condition, levels = c("WT", "∆ex4") )) |>
  mutate(Spec = as.integer(Spec),
         Norm_Spec = Spec / Protein_Length) -> spec_counts

Use Suz12 has a normalising factor.

Code
Su12_spec <- subset(spec_counts, ProteinName == "Suz12")
colnames(Su12_spec)[colnames(Su12_spec) == "Norm_Spec"] <- "Bait_Norm_Spec"
Su12_spec |> 
  group_by(Condition) |>
  mutate(Mean_Spec = mean(Bait_Norm_Spec, na.rm = T) ) -> Suz12_spec

Calculate the stoichiometry ratio relative to Suz12

Code
spec_counts |>
  full_join(Suz12_spec[, c("Sample", "Mean_Spec")], by = "Sample") |>
  group_by(Sample) |>
  mutate(Ratio = Norm_Spec / Mean_Spec) |>
  ungroup() |>
  group_by(UniProtID, Condition) |>
  mutate(Mean_Ratio = mean(Ratio, na.rm = T) ) |>
  mutate(Sd_Ratio = sd(Ratio, na.rm = T) ) -> spec_ratio

Group proteins by PRC2 core or subgroup and calculate p-values with t.test(). Remove the embryonic Aebp2 isoform (Q9Z248-3) with very very low PSMs from this plot for sake of simplicity.

Code
PRC2_data <- subset(spec_ratio, ProteinName %in% PRC2)
PRC2_data <- subset(spec_ratio, UniProtID != "Q9Z248-3")

# Subdivide by complex
PRC2_data |>
  mutate(Complex = case_when(ProteinName %in% core ~ "PRC2 Core",
                             ProteinName %in% PRC2.1 ~ "PRC2.1",
                             ProteinName %in% PRC2.2 ~ "PRC2.2",
                             )) |>
  relocate(Complex, .after = Condition) -> PRC2_data

# Set decreasing plotting order
PRC2_data |>
  subset(Condition == "WT") |>
  arrange(desc(Mean_Ratio)) |>
  pull(ProteinName) |> unique() -> genes_order

# Slight adjustment to have them in a miningful order 
genes_order <- factor(genes_order, levels = PRC2)

PRC2_data$ProteinName <- factor(PRC2_data$ProteinName, levels = PRC2)

## calculate significance

# Select genes with a mean ∆ stoichiometry ratio <= -0.1
PRC2_data |>
  group_by(ProteinName, Condition, Mean_Ratio) |>
  summarise() |>
  ungroup() |>
  group_by(ProteinName) |>
  mutate(Delta_Ratio = round(Mean_Ratio - Mean_Ratio[Condition == "WT"], 2)) |>
  relocate(Delta_Ratio, .after = Condition) |>
  arrange(Delta_Ratio) |>
  subset(abs(Delta_Ratio) >= 0.1) |>
  pull(ProteinName) -> genes_to_test

PRC2_data |>
  subset(ProteinName %in% genes_to_test) |>
  group_by(ProteinName) |>
  mutate(Pval = t.test(x = Ratio[Condition == "WT"], 
                            y = Ratio[Condition == "∆ex4"], 
                            exact = F, alternative = "two.sided")$p.value ) |>
  mutate(Pval = signif(Pval, 2) ) |>
  relocate(Pval, .after = Condition) |>
  mutate(Y_Pos = max(Mean_Ratio + Sd_Ratio + 0.1)) |>
  select(ProteinName, Condition, Complex, Pval, Mean_Ratio, Sd_Ratio, Y_Pos) |>
  unique() |>
  subset(Pval <= 0.05) -> signif_df

3.2.3 SUZ12 WT ∆ex4 stoichiometry plot

Plot figure 2F

Code
ggplot(PRC2_data) +
  aes(x = ProteinName, y = Ratio, fill = Condition) +
  facet_grid(~ Complex, scales = "free", space = 'free_x') +
  geom_vline(xintercept = 6.5, colour = "black") +
  geom_col(data = subset(PRC2_data, Sample %in% c("WTp", "dEX4#1" )), aes(y = Mean_Ratio),
           position = position_dodge(width = 0.75), width = 0.75,
           colour = "black", lwd = 0.125, show.legend = F) +
  geom_errorbar(position = position_dodge(width = 0.75), 
                aes(ymin = Mean_Ratio - Sd_Ratio, ymax = Mean_Ratio + Sd_Ratio),
                width = 0.3, lwd = 0.125)+
  geom_point(shape = 21,
             position = position_jitterdodge(dodge.width = 0.75, jitter.width = 0.4, jitter.height = 0),
             stroke = 0.16)  +
  geom_signif(data = signif_df, inherit.aes = F, 
              aes(xmin = ProteinName, xmax = ProteinName, y_position = Y_Pos, annotations = Pval ),
              textsize = 2, family = "Arial", lwd = 0.125, manual = TRUE) +
  # add axis vertical line to all facets.
  annotate("segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf, lwd = 0.125) +
  scale_fill_manual(values = c("WT" = "royalblue3", "∆ex4" = "goldenrod")) +
  scale_y_continuous(expand = expansion(add = 0, mult = c(0, 0.05)),
                     n.breaks = 6) +
  scale_x_discrete(expand = expansion(mult = c(0.05, 0), add = 0) ) +
  labs(y = "Stoichiometry") +
  coord_cartesian(clip = 'on', ylim = c(0, 1.3) )  +
  MSBar_plot_theme -> fig_2_barplot
fig_2_barplot

Save plot to pdf.

Code
ggsave(filename = "Fig2F_Stoichiometry_BarPlot.pdf", plot = fig_2_barplot, 
       path = pdf_dir_fig2, device = cairo_pdf,
       width = 8.75, height = 5.2, units = "cm")

3.2.4 Differencial enrichment analysis of WT and ∆exon 4 Suz12

How many high quality proteins are in this dataset?

Code
table(tidy_prot_wt_dEx4$swissprot)

  sp   tr 
1301  243 

Import metadata.

Code
Resce_WT_Metada <- read_excel(path = WT_Dex4_Metada_path)
Resce_WT_Metada$condition <- factor(Resce_WT_Metada$condition, 
                                    levels = c("WT", "Dexon4") )

Create a summarised Experiment with the PSMs. The function make_se takes the PSMs counts and transforms them in log2 scale. So that values that are 1 become zero and values that were zero PSMs become NA.

Code
data_unique <- make_unique(tidy_prot_wt_dEx4, ids = "UniProtID", names = "ProteinName") 
PSM_columns <- grep(pattern = paste(Resce_WT_Metada$label, collapse = "|"), colnames(data_unique))
data_se <- make_se(data_unique, PSM_columns, Resce_WT_Metada)
Code
plot_frequency(data_se)

Filter a bit for not very abundant proteins.

Code
data_filt <- filter_missval(data_se, thr = 0)

3.2.5 Normalise the data

Performs variance stabilizing transformation using vsnpackage (Huber et al. 2002).

Code
data_norm <- normalize_vsn(data_filt)

Verify normalisation with meanSdPlot()

Code
meanSdPlot(x = data_norm) 

Looks good. Next check data filtering and normalisation impact on the PSM values

Code
plot_normalization(data_se, data_filt, data_norm) +
  scale_fill_manual(values = c("WT" = "royalblue3", "Dexon4" = "goldenrod")) +
  theme(axis.text = element_text(color = 'black'),
        plot.background = element_blank(), 
        panel.background = element_blank())

3.2.6 Impute missing values

Check the missing values in the dataset.

Code
plot_missval(data_norm)

There aren’t a lot of missing values but for consistency with the other dataset I impute the missing proteins values anyway.

Code
plot_detect(data_filt)

Impute missing values with different methods and then pick one. The imputation is performed on the dataset where proteins missing too many values were already filtered out (see filter_missval above), and on the vsn normalised data (see normalize_vsn above). For details on imputation algorithms check ?MsCoreUtils::impute_matrix().

Code
# Impute missing data using random draws from a 
# Gaussian distribution centred around a minimal value (for MNAR: missing not at random )
data_imp_min <- DEP::impute(data_norm, fun = "MinProb", q = 0.01)
[1] 0.5854005
Code
# Impute missing data using random draws from a 
# manually defined left-shifted Gaussian distribution (for MNAR: missing not at random )
data_imp_man <- DEP::impute(data_norm, fun = "man", shift = 1.8, scale = 0.3)

# Impute missing data using the k-nearest neighbour approach (for MAR: missing at random)
data_imp_knn <- DEP::impute(data_norm, fun = "knn", rowmax = 0.9)

# Impute missing data using the Maximum likelihood-based imputation method using the EM algorithm (for MAR: missing at random)
data_imp_mle <- DEP::impute(data_norm, fun = "MLE")
Iterations of EM: 
1...2...3...4...5...6...7...
Code
## Add zero to missing data
data_imp_zer <- DEP::impute(data_norm, fun = "zero")

Here I also try to impute missing values in a tailor made fashion, using a mixed model as described in the advanced section of the DEP package vignette.

Here I consider a protein to have missing values NOT at random (MNAR) if it has missing values where the sum of the filtered, normalised PSMs across replicates is less than 1. Meaning that if a protein is detected as NA, NA, and 1 across 3 biological replicates of the ∆exon4 samples, the missing values (if any) will be set to the minumum value in the range (i.e. ~ -3).

Code
get_df_long(data_norm) |>
  group_by(name, condition) |>
  summarize(Sum_PSMs = sum(intensity, na.rm = T), .groups = 'keep' ) |>
  subset(condition == "Dexon4" )  |>
  summarise(NAs = Sum_PSMs <= 1, .groups = 'keep') |>
  subset(NAs)  |>
  pull(name) |> unique() -> proteins_MNAR

# Get a logical vector
MNAR <- names(data_norm) %in% proteins_MNAR

message("Identified: ", length(proteins_MNAR), " proteins missing NOT at random which will be min-imputed")
Identified: 128 proteins missing NOT at random which will be min-imputed
Code
# Perform a mixed imputation
data_imp_mixed <- DEP::impute(data_norm, 
                         fun = "mixed",
                         randna = !MNAR, # we have to define MAR which is the opposite of MNAR
                         mar = "knn", # imputation function for MAR
                         rowmax = 0.9,
                         mnar = "min") # imputation function for MNAR

Plot the protein intensity distribution of the original dataset with lots of missing values, the filtered, the normalised one, and the 3 different type of data imputation methods.

Code
plot_imputation(data_filt, data_norm, data_imp_mle, data_imp_knn, data_imp_mixed) +
  scale_colour_manual(values = c("WT" = "royalblue3", "Dexon4" = "goldenrod")) +
  theme(axis.text = element_text(color = 'black'),
        legend.position = "bottom",
        plot.background = element_blank(), 
        strip.background = element_blank(),
        panel.background = element_blank())

Between the mixed imputation and the knn imputation there’s barely any difference. Good, it means that there’re many missing values missing not at random.

Code
plot_normalization(data_norm, data_imp_knn, data_imp_mixed) +
  scale_fill_manual(values = c("WT" = "royalblue3", "Dexon4" = "goldenrod")) +
  theme(axis.text = element_text(color = 'black'),
        legend.position = "bottom",
        plot.background = element_blank(), 
        panel.background = element_blank())

Perform a PCA on the mixed imputed samples. As expected the samples major driver of diversity is the Suz12 antibody used (either homemade on the left or the commercial one on the right).

Code
plot_pca(data_imp_mixed, x = 1, y = 2, n = 500, point_size = 4) +
  scale_colour_manual(values = c("WT" = "royalblue3", "Dexon4" = "goldenrod")) +
  theme(plot.background = element_blank(), 
        plot.title = element_blank(),
        panel.background = element_blank()) 

Following the info in the DEP vignette I test which imputation method is the best.

The methods I tested are:

  • No imputation, but filtering and normalisation
  • MinProb imputation
  • man shifted Gaussian imputation
  • knn imputation
  • MLE imputation
  • zero imputation
  • Mixture of min for MNAR in Suz12 KO and knn for MAR

For this comparison only I use a significant threshold of 0.05 and a log2 Fold Change higher than |1.25|.

Code
# DE analysis on no, MinProb, man, knn, and MLE imputations
no_imputation_res <- DE_analysis(data_norm)
minprob_imputation_res <- DE_analysis(data_imp_min)
man_imputation_res <- DE_analysis(data_imp_man)
knn_imputation_res <- DE_analysis(data_imp_knn)
mle_imputation_res <- DE_analysis(data_imp_mle)
zer_imputation_res <- DE_analysis(data_imp_zer)
mix_imputation_res <- DE_analysis(data_imp_mixed)

Check how many differential enriched proteins there are in each dataset.

Code
# Number of significant proteins
objects <- c("no_imputation_res", "minprob_imputation_res", 
             "man_imputation_res", "knn_imputation_res", 
             "mle_imputation_res", "zer_imputation_res", 
             "mix_imputation_res")

kable(purrr::map_df(objects, DE_prots))
Dataset Num_Signif_Proteins Proteins_Names
no_imputation_res 5 Aebp2 / Esco2 / Ezh1 / Jarid2 / Kmt2b
minprob_imputation_res 9 Aebp2 / Atpb / Ddx46 / Esco2 / Ezh1 / Jarid2 / Kmt2b / Pwp3a / Utf1
man_imputation_res 5 Aebp2 / Ddx46 / Ezh1 / Jarid2 / Kmt2b
knn_imputation_res 8 Aebp2 / Atpb / Brx1 / Esco2 / Ezh1 / Jarid2 / Kmt2b / Pwp3a
mle_imputation_res 4 Ezh1 / Jarid2 / Kmt2b / Utf1
zer_imputation_res 8 Aebp2 / Atpb / Ddx46 / Esco2 / Ezh1 / Jarid2 / Kmt2b / Pwp3a
mix_imputation_res 6 Aebp2 / Atpb / Esco2 / Ezh1 / Jarid2 / Kmt2b

To further compare the results of the different imputation methods one could use ROC curves.

Code
# Get ROC data 
ROC_df <- purrr::map_df(objects, get_ROC_df)

Plot ROC

Code
ggplot(ROC_df, aes(FPR, TPR, col = method)) +
  geom_line() +
  coord_cartesian(xlim = c(0, 0.1), ylim = c(0, 1)) +
  labs(title = "ROC-curve") +
  theme(axis.text = element_text(color = 'black'),
        plot.background = element_blank(), 
        strip.background = element_blank(),
        panel.background = element_blank())

Basically all methods are good and show no significant improvement over no imputation of missing values. However for consistency with the other dataset I decided to use the mixed imputation method and get the final result table for the volcano plot.

Consider significant everything with an FDR <= 0.05 and a log2 fold change more than 1.5.

Code
# Test every sample versus control
data_diff <- test_diff(data_imp_mixed, type = "control", control = "WT")
Tested contrasts: Dexon4_vs_WT
Code
# Denote significant proteins based on user defined cutoffs
dep <- add_rejections(data_diff, alpha = 0.05, lfc = 1.5 )

Plot sample correlation and see that replicates have very good correlation

Code
plot_cor(dep, significant = TRUE, lower = 0, upper = 1, pal = "PuOr") 

Check the intensity of the significant proteins found in the experiment

Code
plot_heatmap(dep, type = "centered", kmeans = F, 
             col_limit = 4, show_row_names = TRUE,
             indicate = c("condition", "replicate"))

Now plot the volcano plot results

Code
data_results <- get_results(dep)

Select proteins to label and set fill colour for Aebp2 and Jarid2

Code
prot_to_label <- c("Aebp2", "Jarid2", "Ezh1", "Kmt2b", "Atpb")

data_results <- mutate(data_results, 
                       Label = ifelse(name %in% prot_to_label, yes = TRUE, no = FALSE),
                       pretty_name = case_when(Label == TRUE ~ str_wrap(name, width = 10),
                                         Label == FALSE ~ name) ) |>
  mutate(Label_fill = case_when(name == "Aebp2" ~ "forestgreen",
                                name == "Jarid2" ~ "hotpink", 
                                !name %in% c("Aebp2", "Jarid2") ~ "white"))

3.2.7 Volcano plot ∆exon4 vs WT Suz12 IP

Plot figure 2E

Code
ggplot(data = data_results) +
  aes(x = Dexon4_vs_WT_ratio, y = -log10(Dexon4_vs_WT_p.val), 
      fill = Dexon4_vs_WT_significant ) +
  geom_point(shape = 21, show.legend = F, stroke = 0.16)  +
  labs(x = expression(log[2] ~ "Fold Change"), y = expression(-log[10] ~ "P-Value")) +
  coord_cartesian(xlim = c(-4.5, 3.0), ylim = c(0, 4.5),  clip = 'on') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.01), add = c(0.02, 0.04)),
                     n.breaks = 7) +
  scale_x_continuous(n.breaks = 7, expand = expansion(add = 0, mult = 0)) +
  Volcano_plot_theme -> p_basic_Volcano

p_Volcano1 <- p_basic_Volcano + 
  annotate(geom = "label", x = 2.25,  y = 0.35, label = "∆ex4", 
           colour = "black", fill = "goldenrod", size = 2,
           label.padding = grid::unit(0.5, "mm"), 
           label.r = unit(0.25, "mm"), family = "Arial",
           label.size = grid::unit(0.125, "mm") ) +
  annotate(geom = "label", x = -3.5, y = 0.35, label = "WT", 
           colour = "black",  fill = "royalblue3", size = 2, 
           label.padding = grid::unit(0.5, "mm"), 
           label.r = unit(0.25, "mm"), family = "Arial",
           label.size = grid::unit(0.125, "mm") ) +
  # points on the right side
  geom_label_repel(data = subset(data_results, Label & Dexon4_vs_WT_ratio > 0 ), 
                   aes(label = pretty_name, fill = Label_fill), 
                   seed = 16, show.legend = F, segment.curvature = -1e-20, 
                   family = "Arial", size = 2, nudge_x = -0.25,
                   segment.color = 'black',verbose = F, 
                   box.padding = grid::unit(1, "mm"),
                   point.padding = grid::unit(0.55, "mm"),
                   label.padding = grid::unit(0.5, "mm"), 
                   label.size = 0.125, 
                   max.overlaps = 20)  +
  # points on the left side
  geom_label_repel(data = subset(data_results, Label & Dexon4_vs_WT_ratio < 0), 
                 aes(label = pretty_name, fill = Label_fill), 
                 seed = 16, show.legend = F, segment.curvature = -1e-20, 
                 family = "Arial", size = 2, nudge_x = 0.25,
                 segment.color = 'black',verbose = F, 
                 box.padding = grid::unit(1, "mm"),
                 point.padding = grid::unit(0.55, "mm"),
                 label.padding = grid::unit(0.5, "mm"), 
                 label.size = 0.125, 
                 max.overlaps = 20)  +
  scale_fill_manual(values = c('gray84', "forestgreen", "hotpink", 'firebrick2', "white"))

p_Volcano1

Save to pdf.

Code
ggsave(filename = "Fig2E_dExon4_vs_WT_Volcano.pdf", plot = p_Volcano1,
       path = pdf_dir_fig2, device = cairo_pdf, units = "cm",
       width = 4.2, height = 5.25)

Make it interactive.

Code
p_interactive_Volcano <- p_basic_Volcano +
  geom_point(shape = 21, show.legend = F, aes(text = paste0(name)) ) +
  scale_fill_manual(values = c("TRUE" = 'firebrick2', "FALSE" = 'gray84') ) +
  labs(x = "log2 Fold Change", y = "-log10 P-Value")

Explore the volcano plot interactively.

Code
ggplotly(p_interactive_Volcano)

Since I notice that Ezh1 and Kmt2b are significantly enriched but have a log2 Fold Change lower than 2 I explore their significance in the light of the fact that they are very low PSMs. To address this I make an MA plot like it’s done for RNA-seq.

Code
assay(data_imp_mixed) |>
  as.data.frame() |>
  rownames_to_column("name") |>
  pivot_longer(cols = !c("name"), names_to = "Sample", values_to = "Norm_PSMs") |>
  mutate(Condition = case_when( grepl(x = Sample, pattern = "WT") ~ "WT",
                                grepl(x = Sample, pattern = "Dexon4") ~ "∆ex4") ) |>
  relocate(Condition, .after = Sample) |>
  mutate(Condition = factor(Condition, levels = c("WT", "∆ex4") )) |>
  group_by(name) |>
  mutate(Mean_PSM = mean(Norm_PSMs, na.rm = T)) |>
  select(name, Mean_PSM) |>
  unique() |>
  right_join(data_results, by = "name") |>
  relocate(ID, .after = name) -> res_WT_Dexon4

Check the PSMs of 2 proteins: EZH1 and KMT2B.

Plot Mean PSMs vs log2 Fold Change.

Code
res_WT_Dexon4 |>
mutate(Label = ifelse(name %in% prot_to_label, yes = TRUE, no = FALSE),
       pretty_name = case_when(Label == TRUE ~ str_wrap(name, width = 10), Label == FALSE ~ name) ) |>
  mutate(Label_fill = case_when(name == "Aebp2" ~ "forestgreen",
                                name == "Jarid2" ~ "hotpink", 
                                !name %in% c("Aebp2", "Jarid2") ~ "white")) -> res_WT_Dexon4

ggplot(res_WT_Dexon4) +
  aes(x = Mean_PSM, y = Dexon4_vs_WT_ratio, fill = Dexon4_vs_WT_significant ) +
  geom_point(shape = 21, show.legend = F, stroke = 0.16)  +
  geom_hline(yintercept = 0, colour = "black") +
  labs(x = expression(log[2] ~ "Mean PSM"), y = expression(-log[2] ~ "Fold Change")) +
  scale_y_continuous(n.breaks = 7) +
  scale_x_continuous(n.breaks = 7) -> p_basic_MA_plot

p_basic_MA_plot +
  # points on top
  geom_label_repel(data = subset(res_WT_Dexon4, Label & Dexon4_vs_WT_ratio > 0 ), 
                   aes(label = pretty_name, fill = Label_fill), 
                   seed = 16, show.legend = F, segment.curvature = -1e-20, 
                   family = "Arial", size = 2, nudge_x = -0.25,
                   segment.color = 'black',verbose = F, 
                   box.padding = grid::unit(1, "mm"),
                   point.padding = grid::unit(0.55, "mm"),
                   label.padding = grid::unit(0.5, "mm"), 
                   label.size = 0.125, 
                   max.overlaps = 20)  +
  # points on the bottom
  geom_label_repel(data = subset(res_WT_Dexon4, Label & Dexon4_vs_WT_ratio < 0), 
                 aes(label = pretty_name, fill = Label_fill), 
                 seed = 16, show.legend = F, segment.curvature = -1e-20, 
                 family = "Arial", size = 2, nudge_x = 0.25,
                 segment.color = 'black',verbose = F, 
                 box.padding = grid::unit(1, "mm"),
                 point.padding = grid::unit(0.55, "mm"),
                 label.padding = grid::unit(0.5, "mm"), 
                 label.size = 0.125, 
                 max.overlaps = 20) +

  scale_fill_manual(values = c('gray84', "forestgreen", "hotpink", 'firebrick2', "white")) +
  Volcano_plot_theme -> p_MA_plot_WT_Dexon4
p_MA_plot_WT_Dexon4

As one can see in fact both Ezh1 and Kmt2b (and Esco2) have very low PSMs (below 1). This means that they could be actually differentially bond to Suz12 WT and Suz12 ∆exon4 however their signal is so low it is hard to draw conclusions form it.

Save to pdf.

Code
ggsave(filename = "MAplot_dExon4_vs_WT.pdf", plot = p_MA_plot_WT_Dexon4,
       device = cairo_pdf, path = pdf_dir_fig2, width = 4.5, height = 5.25, units = "cm")
Code
p_interactive_MA <- p_basic_MA_plot +
  geom_point(shape = 21, show.legend = F, aes(text = paste0(name)) ) +
  scale_fill_manual(values = c("TRUE" = 'firebrick2', "FALSE" = 'gray84') ) +
  labs(x = "log2 Mean PSMs", y = "log2 Fold Change") +
  Volcano_plot_theme 

I also made an interactive MA plot if one is interested in exploring more the proteins.

Code
ggplotly(p_interactive_MA)

Now analyse the other dataset.

Code
rm(tidy_prot_wt_dEx4, PRC2_data, signif_df, spec_counts, spec_ratio, PRC2_prot_wt_dEx4, fig_2_barplot)

4 Supplementary Figure Panels

4.1 Rescue and Suz12 KO Flag IPs

Now perform the same analysis but on the other dataset.

4.1.1 Import PSMs from mzTab file for the rescue dataset

Code
prot_Res_KO <- readMzTabData(file = Resce_KO_mzTAB_path, what = "PRT", version = "1.0")

# tidy up the data
exprs(prot_Res_KO) |>
  as.data.frame() |>
  rownames_to_column(var = "info") |>
  setNames(c("info", "SKOrL_1", "SKOrL_2", "SKOrL_3",
            "SKOrS_1", "SKOrS_2", "SKOrS_3", "SKO_1", "SKO_2", "SKO_3") ) |>
  tidyr::drop_na() |> 
  mutate(swissprot = str_extract(pattern = "^[s-t][p|r]", string = info)) |>
  # extract the UniProt ID while preserving alternative isoform ID (dashed number)
  mutate(UniProtID = str_extract(pattern = "(?<=^[s-t][p|r].)[A0-Z9](.*?)\\.([1-9]?)", string = info)) |> 
  mutate(UniProtID = sub(pattern = "\\.", replacement = "-", x = UniProtID)) |>
  mutate(UniProtID = sub(pattern = "-$", replacement = "", x = UniProtID)) |>
  # Extract species name
  mutate(Species = str_extract(pattern = "_[A0-Z9](.*?)(?=\\.1)", string = info)) |>
  mutate(Species = sub(pattern = "^_", replacement = "", x = Species)) |>
  # Not really the protein name, it's an abbreviation used by uniprot
  mutate(ProteinName = str_extract(pattern = "(?<=^[s-t][p|r]\\.[A0-Z9].....).*?\\.[A0-Z9](.*?)(?=_)",
                                   string = info)) |>
  mutate(ProteinName = str_remove(string = ProteinName, pattern = "(.*?)\\.")) |>
  mutate(ProteinName = str_remove(string = ProteinName, pattern = "([1-9]?)\\.")) |>
  mutate(ProteinName = gsub(pattern = "\\.", "-", x = ProteinName )) |>
  mutate(ProteinName = str_to_title(string = ProteinName) ) -> tidy_prot_rescue_KO

Here I manually fix 2 protein names like in the previous dataset. First Jarid2 is called Jard2, this is because it comes from JARD2_MOUSE name used by UniProt. The second is the protein called EED which in this experiment is mapped to the not so well annotated protein with ID A0A5F8MPX8.

Code
tidy_prot_rescue_KO |>
  # Fix Jarid2 name that comes from JARD2_MOUSE
  mutate(ProteinName = sub(pattern = "Jard2", replacement = "Jarid2", x = ProteinName)) |>
  mutate(ProteinName = ifelse(test = UniProtID == "A0A5F8MPX8", yes = "Eed", no = ProteinName)) |>
  relocate(swissprot, UniProtID, Species, ProteinName, .after = info) -> tidy_prot_rescue_KO

4.1.2 Calculate a stoichiometry score for PRC2 proteins in the rescue dataset

Subset for these proteins.

Code
PRC2_prot_rescue_KO <- subset(tidy_prot_rescue_KO, ProteinName %in% PRC2)

Use the UniprotR package to query UniProt and get protein length. This could be a bit time consuming.

Code
PRC2_prot_rescue_KO$Protein_Length <- UniprotR::GetSequences(ProteinAccList = PRC2_prot_rescue_KO$UniProtID)$Length

Reshape the data like before.

Code
PRC2_prot_rescue_KO |>
  pivot_longer(cols = c(starts_with("SKO") ),
               names_to = "Sample", values_to = "Spec") |>
  mutate(Condition = case_when( grepl(x = Sample, pattern = "SKO_") ~ "KO",
                                grepl(x = Sample, pattern = "SKOrS_") ~ "KOrS",
                                grepl(x = Sample, pattern = "SKOrL_") ~ "KOrL") ) |>
  relocate(Condition, .after = Sample) |>
  mutate(Condition = factor(Condition, levels = c("KOrL", "KOrS", "KO") )) |>
  mutate(Spec = as.integer(Spec),
         Norm_Spec = Spec / Protein_Length) -> spec_counts

Use Suz12 has a normalising factor.

Code
Su12_spec <- subset(spec_counts, ProteinName == "Suz12")
colnames(Su12_spec)[colnames(Su12_spec) == "Norm_Spec"] <- "Bait_Norm_Spec"
Su12_spec |> 
  group_by(Condition) |>
  mutate(Mean_Spec = mean(Bait_Norm_Spec, na.rm = T) ) -> Suz12_spec

Calculate the stoichiometry ratio relative to Suz12.

Code
spec_counts |>
  full_join(Suz12_spec[, c("Sample", "Mean_Spec")], by = "Sample") |>
  group_by(Sample) |>
  mutate(Ratio = Norm_Spec / Mean_Spec) |>
  ungroup() |>
  group_by(UniProtID, Condition) |>
  mutate(Mean_Ratio = mean(Ratio, na.rm = T) ) |>
  mutate(Sd_Ratio = sd(Ratio, na.rm = T) ) -> spec_ratio

Group proteins by PRC2 core or subgroup and calculate p-values with t.test(). Remove the embryonic Aebp2 isoform (Q9Z248-3) with very very low PSMs from this plot for sake of simplicity and also an Ezh1 variant (P70351-2).

Code
PRC2_data <- subset(spec_ratio, ProteinName %in% PRC2)
PRC2_data <- subset(PRC2_data, Condition != "KO")
PRC2_data <- subset(PRC2_data, UniProtID != "Q9Z248-3")
PRC2_data <- subset(PRC2_data, UniProtID != "P70351-2")

# Subdivide by complex
PRC2_data |>
  mutate(Complex = case_when(ProteinName %in% core ~ "PRC2 Core",
                             ProteinName %in% PRC2.1 ~ "PRC2.1",
                             ProteinName %in% PRC2.2 ~ "PRC2.2",
                             )) |>
  relocate(Complex, .after = Condition) -> PRC2_data

# Set decreasing plotting order as in previous plot
PRC2_data$ProteinName <- factor(PRC2_data$ProteinName, levels = genes_order)

## calculate significance

# Select genes with a mean ∆ stoichiometry ratio <= -0.1
PRC2_data |>
  group_by(ProteinName, Condition, Mean_Ratio) |>
  summarise() |>
  ungroup() |>
  group_by(ProteinName) |>
  mutate(Delta_Ratio = round(Mean_Ratio - Mean_Ratio[Condition == "KOrL"], 2)) |>
  relocate(Delta_Ratio, .after = Condition) |>
  arrange(Delta_Ratio) |>
  subset( !between(Delta_Ratio, left = -0.01, right = 0.01)) |>
  pull(ProteinName) -> genes_to_test

genes_to_test <- c("Ezh1", "Eloc", "Epop", "Phf19", "Jarid2", "Aebp2", "Ezhip")

PRC2_data |>
  subset(ProteinName %in% genes_to_test) |>
  group_by(ProteinName) |>
  mutate(Pval = t.test(x = Ratio[Condition == "KOrL"], 
                            y = Ratio[Condition == "KOrS"], 
                            exact = F, alternative = "two.sided")$p.value ) |>
  mutate(Pval = signif(Pval, 2) ) |>
  relocate(Pval, .after = Condition) |>
  mutate(Y_Pos = max(Mean_Ratio+ Sd_Ratio + 0.1)) |>
  select(ProteinName, Condition, Complex, Pval, Mean_Ratio, Sd_Ratio, Y_Pos) |>
  unique() |>
  subset(Pval <= 0.12) -> signif_df

4.1.3 SUZ12 KO and Rescues stoichiometry plot

For supplementary figure S2M

Code
ggplot(PRC2_data) +
  aes(x = ProteinName, y = Ratio, fill = Condition) +
  facet_grid(~ Complex, scales = "free", space = 'free_x') +
  geom_vline(xintercept = 6.5, colour = "black") +
  geom_col(data = subset(PRC2_data, Sample %in% c("SKOrS_1", "SKOrL_1")), 
           aes(y = Mean_Ratio),
           position = position_dodge(width = 0.75), width = 0.75,
           colour = "black", lwd = 0.125, show.legend = F) +
  geom_errorbar(position = position_dodge(width = 0.75), 
                aes(ymin = Mean_Ratio - Sd_Ratio, ymax = Mean_Ratio + Sd_Ratio),
                width = 0.3, lwd = 0.125)+
  geom_point(shape = 21, 
             position = position_jitterdodge(dodge.width = 0.75, jitter.width = 0.4, jitter.height = 0), 
             stroke = 0.16)  +
  geom_signif(data = signif_df, inherit.aes = F,
              aes(xmin = ProteinName, xmax = ProteinName, y_position = Y_Pos, annotations = Pval ),
             textsize = 2, family = "Arial", lwd = 0.125, manual = TRUE) +
  # add axis vertical line to all facets.
  annotate("segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf, lwd = 0.125) +
  scale_fill_manual(values = c("KOrL" = "mediumpurple3", "KOrS" = "darkorange1")) +
  scale_y_continuous(expand = expansion(add = 0, mult = c(0, 0.05)),
                     n.breaks = 6) +
  scale_x_discrete(expand = expansion(mult = c(0.05, 0), add = 0) ) +
  labs(y = "Stoichiometry") +
  coord_cartesian(clip = 'on', ylim = c(0, 1.4))  +
  MSBar_plot_theme -> fig_S2_barplot_rescues
fig_S2_barplot_rescues

Save plot to pdf.

Code
ggsave(filename = "FigS2M_Stoichiometry_BarPlot.pdf", 
       path = pdf_dir_fig2, plot = fig_S2_barplot_rescues, device = cairo_pdf,
       units = "cm", width = 9.5, height = 5.2, )

4.1.4 Differencial enrichment analysis of Suz12 KO and Long and Short Suz12 rescues

Import metadata

Code
Resce_KO_Metada <- read_excel(path = Resce_KO_Metada_path)
Resce_KO_Metada$condition <- factor(Resce_KO_Metada$condition, 
                                    levels = c("KOrL", "KOrS", "KO") )

Create a summarised Experiment with the PSMs.

Code
data_unique <- make_unique(tidy_prot_rescue_KO, ids = "UniProtID", names = "ProteinName") 
PSM_columns <- grep(pattern = paste(Resce_KO_Metada$label, collapse = "|"), colnames(data_unique))
data_se <- make_se(data_unique, PSM_columns, Resce_KO_Metada)

Now I can use some of the plotting functions of DEP to explore the data.

Code
plot_frequency(data_se)

Test different filtering thresholds. The function filter_missval filters the dataset for proteins that have a maximum of thr missing values in at least one condition. If there are 3 biological replicates thr = 0 means to keep only the proteins that are identified in all 3 replicates and thr = 1 keeps all the proteins that are identified in at least 2 samples of each biological condition. I’ll stick to thr = 0 but I tried less stringent filter with little difference.

Code
data_filt <- filter_missval(data_se, thr = 0)
data_filt2 <- filter_missval(data_se, thr = 1)
data_filt3 <- filter_proteins(data_se, "fraction", min = 0.6)

Now the number of proteins identified is different

Code
plot_frequency(data_filt)

I can also check the number of proteins identified in each sample. For sure Suz12 KO rescued with Long Suz12 has the highest number of samples. And in general replicates have similar number of identified proteins.

Code
plot_numbers(data_filt) +
  scale_fill_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  scale_y_continuous(n.breaks = 10)

Explore the data by dimensionality reduction.

Code
plot_pca(data_filt, x = 1, y = 2, n = 300, point_size = 4) +
  scale_colour_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  theme(plot.background = element_blank(), 
        plot.title = element_blank(),
        panel.background = element_blank()) 

Sample separate well between each other.

4.1.5 Normalise the data

Performs variance stabilizing transformation using vsnpackage. I do this on the dataset with filtered missing values default threshold (thr = 0).

Code
data_norm <- normalize_vsn(data_filt)
Code
meanSdPlot(x = data_norm, ranks = T)

Check data filtering and normalisation impact on the PSM values

Code
plot_normalization(data_se, data_filt, data_norm) +
  scale_fill_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  theme(axis.text = element_text(color = 'black'),
        plot.background = element_blank(), 
        panel.background = element_blank())

I think vsn normalisation is increasing too much the intensities in the Suz12 KO samples (grey).

4.1.6 Impute missing values

Check the missing values in the dataset.

Code
plot_missval(data_norm)

As expected in MS proteomics experiments there are a lot of missing values, and this missing proteins are generally of low intensity. Here in this dataset, the Suz12 KO samples have most of the missing values and Suz12KO rescued with Short isoform to a certain extent. I believe this is because most of the proteins identified in the Suz12 KO are unspecific background proteins. This makes sense ad they’re used as control and do not pull down Suz12.

Code
plot_detect(data_filt)

Impute missing values with different methods and then pick one. The imputation is performed on the dataset where proteins missing too many values were already filtered out (see filter_missval above), and on the vsn normalised data (see normalize_vsn above). For details on imputation algorithms check ?MsCoreUtils::impute_matrix().

Code
# Impute missing data using random draws from a 
# Gaussian distribution centred around a minimal value (for MNAR: missing not at random )
data_imp_min <- DEP::impute(data_norm, fun = "MinProb", q = 0.01)
[1] 1.466768
Code
# Impute missing data using random draws from a 
# manually defined left-shifted Gaussian distribution (for MNAR: missing not at random )
data_imp_man <- DEP::impute(data_norm, fun = "man", shift = 1.8, scale = 0.3)

# Impute missing data using the k-nearest neighbour approach (for MAR: missing at random)
data_imp_knn <- DEP::impute(data_norm, fun = "knn", rowmax = 0.9)

# Impute missing data using the Maximum likelihood-based imputation method using the EM algorithm (for MAR: missing at random)
data_imp_mle <- DEP::impute(data_norm, fun = "MLE")
Iterations of EM: 
1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...21...22...23...24...25...26...27...28...29...30...31...32...33...34...35...36...37...38...39...40...41...42...43...44...45...46...47...48...49...
Code
## Add zero to missing data
data_imp_zer <- DEP::impute(data_norm, fun = "zero")

Here I also try to impute missing values in a tailor made fashion, using a mixed model as described in the advanced section of the DEP package vignette.

Here I consider a protein to have missing values NOT at random (MNAR) if it has missing values in all replicates of Suz12 KO condition or if the sum of the filtered, not-normalised PSMs across replicates is less than 1. Meaning if a protein is detected as NA, NA, and 1 in 3 biological replicates of Suz12 KO, the missing values to be set to the minumum value in the range (i.e. ~ -3).

Code
get_df_long(data_filt) |>
  group_by(name, condition) |>
  summarize(Sum_PSMs = sum(intensity, na.rm = T), .groups = 'keep' ) |>
  subset(condition == "KO") |>
  summarise(NAs = Sum_PSMs <= 1, .groups = 'keep') |>
  subset(NAs) |>
  pull(name) |> unique() -> proteins_MNAR

message("Identified: ", length(proteins_MNAR), " proteins missing NOT at random which will be zero-imputed")
Identified: 665 proteins missing NOT at random which will be zero-imputed
Code
# Get a logical vector
MNAR <- names(data_norm) %in% proteins_MNAR

# Perform a mixed imputation
data_imp_mixed <- DEP::impute(data_norm, 
                         fun = "mixed",
                         randna = !MNAR, # we have to define MAR which is the opposite of MNAR
                         mar = "knn", # imputation function for MAR
                         rowmax = 0.9,
                         mnar = "min") # imputation function for MNAR

Plot the protein intensity distribution of the original dataset with lots of missing values, the filtered, the normalised one, and the 3 different type of data imputation methods.

Code
plot_imputation(data_se, data_filt, data_imp_mle, data_imp_knn, data_imp_mixed) +
  scale_colour_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  theme(axis.text = element_text(color = 'black'),
        legend.position = "bottom",
        plot.background = element_blank(), 
        strip.background = element_blank(),
        panel.background = element_blank())

One can see how there’s a zero-inflation for Suz12 KO intensities in the mixed model.

Code
plot_normalization(data_norm, data_imp_knn, data_imp_mixed) +
  scale_fill_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  theme(axis.text = element_text(color = 'black'),
        legend.position = "bottom",
        plot.background = element_blank(), 
        panel.background = element_blank())

Do a dimensionality reduction just to check that the imputation still preserves good sample relationships.

Code
plot_pca(data_imp_mixed, x = 1, y = 2, n = 500, point_size = 4) +
  scale_colour_manual(values = c("KOrS" = "darkorange1", "KOrL" = "mediumpurple3", "KO" = "gray84")) +
  theme(plot.background = element_blank(), 
        plot.title = element_blank(),
        panel.background = element_blank()) 

Following the info in the DEP vignette I test which imputation method is the best.

The methods I tested are:

  • No imputation, but filtering and normalisation
  • MinProb imputation
  • man shifted Gaussian imputation
  • knn imputation
  • MLE imputation
  • zero imputation
  • Mixture of min for MNAR in Suz12 KO and knn for MAR

For this comparison only I use a significant threshold of 0.05 and |1.5| log2 Fold change.

Code
# DE analysis on no, MinProb, man, knn, and MLE imputations
no_imputation_res <- DE_analysis_KOrL_KOrS(data_norm)
minprob_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_min)
man_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_man)
knn_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_knn)
mle_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_mle)
zer_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_zer)
mix_imputation_res <- DE_analysis_KOrL_KOrS(data_imp_mixed)

Check how many differential enriched proteins there are in each dataset

Code
# Number of significant proteins
objects <- c("no_imputation_res", "minprob_imputation_res", 
             "man_imputation_res", "knn_imputation_res", 
             "mle_imputation_res", "zer_imputation_res", 
             "mix_imputation_res")

kable(purrr::map_df(objects, DE_prots))
Dataset Num_Signif_Proteins Proteins_Names
no_imputation_res 60 A0a0n4sv80 / A0a0r4j0j6 / A0a2r8vhf9 / Addg / Anx11 / Assy / Cdc23 / Cdk12 / Ck5p2 / Cnot1 / Cor2b / Cpsf5 / Cpsf6 / Cwc15 / D3ywx2 / Ddx10 / Ddx54 / Dhb12 / Dyhc1 / F8wi22 / G3uys8 / Gptc4 / Hp1b3 / Int9 / Kanl2 / Lyric / Nipbl / Nom1 / Nsun5 / Patz1 / Prp31 / Prsr1 / Qser1 / Rad21 / Rbm7 / Rm13 / Rm54 / Rpa34 / Rpb4 / Rprd2 / Rs15a / Ruxg / S30bp / S4r1w5 / Sf01 / Sgo2 / Smc6 / Stag2 / Syf2 / Tcpg / Tefm / Thoc1 / Thoc7 / Tim50 / Tm10c / Ubp48 / Zc3he / Zmym3 / Zn296 / Zn326
minprob_imputation_res 37 A0a0n4sv80 / A0a2r8vhf9 / Addg / Aebp2.1 / Assy / Bud13 / Clk1 / Cnot1 / Cpsf5 / D3ywx2 / Dhb12 / Dyhc1 / F8wi22 / Hp1b3 / Kanl2 / Myo1e / Ncbp1 / Nsun5 / Prsr1 / Qser1 / Rad21 / Rm02 / Rm13 / Rprd2 / S30bp / S4r1w5 / Sf01 / Sgo2 / Smc6 / Smrd2 / Tefm / Thoc7 / Ubp48 / Zc3he / Zmym3 / Zn296 / Zn326
man_imputation_res 77 A0a0a6yvp4 / A0a0g2jf16 / A0a0n4sv80 / A0a1b0gt89 / A0a2r8vhf9 / Addg / Aebp2.1 / Ari5b / B7zcr6 / Bcla3 / Brd2 / Bud13 / Cdc23 / Cdk12 / Chm2a / Clk1 / Cor2b / Cpsf5 / Cpsf6 / Cwc15 / D3ywx2 / Ddx54 / Dhb12 / Dyhc1 / Epc1 / Exosx / G3uys8 / Gptc4 / Hira.1 / Hp1b3 / Int12 / Int9 / Kanl2 / Myl6b / Myo1e / Ncbp1 / Ncbp2 / Nip7 / Nom1 / Nop10 / Nsun5 / P5cr2 / Patz1 / Pp1ra / Ppil3 / Prp31 / Prp4 / Prsr1 / Qser1 / Rad21 / Rbm5 / Rbm7 / Requ / Rm13 / Rpa34 / Rpb11 / Rpb4 / Rpb7 / Rprd2 / Ruxg / S30bp / S4r1w5 / Sf01 / Sgo2 / Smc6 / Smrd2 / Stag2 / Syf2 / Tefm / Thoc7 / Tm10c / Yets2 / Zmym3 / Zn106 / Zn296 / Zn326 / Zn431
knn_imputation_res 25 A0a0n4sv80 / A0a2r8vhf9 / Addg / Ck5p2 / Cpsf5 / D3ywx2 / Ddx10 / Dhb12 / Dyhc1 / Kanl2 / Prp31 / Prsr1 / Rad21 / Rpa34 / Rpb4 / Rprd2 / Sgo2 / Smc6 / Stag2 / Thoc7 / Ubp48 / Zc3he / Zmym3 / Zn296 / Zn326
mle_imputation_res 12 A0a0n4sv80 / A0a2r8vhf9 / Addg / Cpsf6 / D3ywx2 / Dyhc1 / Prp31 / Rpa34 / Rpb4 / Rprd2 / S30bp / Smc6
zer_imputation_res 20 Ap2s1 / Ap3b1 / Cenpa / Ck5p2 / Cor2b / Ddx10 / Dhb12 / Dx39a / Dyhc1 / Hnrpl / In80c / Ncbp1 / Ost48 / Pacn3 / Q6p5b5 / Resf1 / Roa2.1 / Rpb4 / V9gxb6 / Ylpm1
mix_imputation_res 89 A0a0a6yvp4 / A0a0g2jf16 / A0a0n4sv80 / A0a1b0gt89 / A0a494bap7 / Addg / Aebp2.1 / Anx11 / Apc7 / Ari5b / B7zcr6 / Brd2 / Bud13 / Cdc23 / Cdk12 / Cdk13 / Chm2a / Cnot1 / Cor2b / Cpsf6 / Cwc15 / D3ywx2 / Ddx54 / Dhb12 / Dyhc1 / Epc1 / Exosx / F8wi22 / G3uys8 / Gptc4 / Hira.1 / Hnrpc.2 / Hp1b3 / Int10 / Int4 / Int9 / Kanl2 / Knop1 / Mcat / Ncbp1 / Ncbp2 / Nip7 / Nipbl / Nom1 / Nop10 / Nsun5 / P5cr2 / Patz1 / Phf11 / Pphln / Ppil3 / Prp4 / Prsr1 / Q3tjb1 / Qser1 / Rad21 / Rbm5 / Rbm7 / Requ / Rm02 / Rm13 / Rm15 / Rm23 / Rpa34 / Rpb11 / Rpb4 / Rpb7 / Rprd2 / Ruxg / S30bp / S4r1w5 / Sf01 / Sgo2 / Smc6 / Smrd2 / Stag2 / Syf2 / Tdif2 / Tefm / Thoc7 / Tim50 / Tm10c / Toprs / Yets2 / Zmym3 / Zn106 / Zn296 / Zn326 / Zn431

To further compare the results of the different imputation methods one could use ROC curves.

Code
# Function to obtain ROC data
get_ROC_df <- function(results) {
  get(results) |>
  select(name, KOrL_vs_KOrS_p.val, significant) |>
  mutate(
    DE = grepl(T, significant),
    BG = grepl(F, significant)) |>
  arrange(KOrL_vs_KOrS_p.val) -> tmp
  
  mutate(tmp,
      TPR = cumsum(as.numeric(DE)) / length(which(tmp$DE)),
      FPR = cumsum(as.numeric(BG)) / length(which(tmp$BG)),
      method = results) -> tmp
  return(tmp)

}
# Get ROC data 
ROC_df <- purrr::map_df(objects, get_ROC_df)

Plot ROC

Code
ggplot(ROC_df, aes(FPR, TPR, col = method)) +
  geom_line() +
  coord_cartesian(xlim = c(0, 0.1), ylim = c(0, 1)) +
  labs(title = "ROC-curve") +
  theme(axis.text = element_text(color = 'black'),
        plot.background = element_blank(), 
        strip.background = element_blank(),
        panel.background = element_blank())

Basically all methods are good and better than no imputation. So I settled on using the mixed imputation method and get the final result table for the volcano plot.

4.1.7 Test against Suz12 KO as control

Here the SUZ12 KO acts as a control.

Code
data_imp_mixed |>
  test_diff(type = "control", control = "KO") |>
  add_rejections(alpha = 0.05) -> res_KO

Volcano plot: KO+SUZ12-Short

Code
plot_volcano(dep = res_KO, contrast = "KOrS_vs_KO") +
  labs(title = 'KO+S vs KO')

Volcano plot: KO+SUZ12-Long

Code
plot_volcano(dep = res_KO, contrast = "KOrL_vs_KO") +
  labs(title = 'KO+L vs KO')

Remove from the dataset the protein hits that are significant in the Suz12 KO pull downs.

Code
get_results(res_KO) |>
  subset(KOrL_vs_KO_ratio < -0.5) |>
  subset(KOrL_vs_KO_p.adj <= 0.05) |> pull(ID) -> KO_hits_vs_Long

get_results(res_KO) |>
  subset(KOrS_vs_KO_ratio < -0.5) |>
  subset(KOrS_vs_KO_p.adj <= 0.05) |> pull(ID) -> KO_hits_vs_Long

KO_hits <- unique(c(KO_hits_vs_Long, KO_hits_vs_Long))

message("Found : ", length(KO_hits), " proteins that are significantly enriched in the Suz12 KO. I'll discard them as they're unspecific")
Found : 8 proteins that are significantly enriched in the Suz12 KO. I'll discard them as they're unspecific

4.1.8 Test KO+SUZ12-S vs KO+SUZ12-L

Here the KO+SUZ12-L acts as a control.

Code
data_imp_mixed |>
  test_diff(type = "control", control = "KOrL") |>
  add_rejections(alpha = 0.05) |>
  plot_volcano(contrast = "KOrS_vs_KOrL") +
  labs(title = 'KO+S vs KO+L')
Tested contrasts: KOrS_vs_KOrL, KO_vs_KOrL

Consider significant everything with an FDR <= 0.05 and a log2 fold change more than 1.25. Use KO+L as control in the contrast between KO+S vs KO+L.

Code
data_imp_mixed |>
  test_diff(type = "control", control = "KOrL") |>
  add_rejections(alpha = 0.05, lfc = log2(1.25)) |>
  get_results() |>
  select(name, ID, starts_with('KOrS_')) -> data_results_LvsS
Tested contrasts: KOrS_vs_KOrL, KO_vs_KOrL

Get the results when using KO as control

Code
data_imp_mixed |>
  test_diff(type = "control", control = "KO") |>
  add_rejections(alpha = 0.05, lfc = log2(1.25)) |>
  get_results() |>
  select(name, ID, starts_with('KOrS_'), starts_with('KOrL_'), KO_centered ) -> data_results_LnSvsKO
Tested contrasts: KOrL_vs_KO, KOrS_vs_KO

Combine both results

Code
data_results <- full_join(data_results_LvsS, data_results_LnSvsKO, 
                          by = join_by(name, ID, KOrS_centered))

Remove hits that are significant in KO.

Code
# remove KO hits
data_results <- subset(data_results, ! ID %in% KO_hits)

4.1.9 KO+S vs KO+L Volcano plot

Rename the 2 isoforms of Aebp2 to match what is used in the literature

Code
data_results |>
  mutate(name = ifelse(test = name == "Aebp2.1", yes = "Embryo Aebp2", no = name)) |>
  mutate(name = ifelse(test = name == "Aebp2", yes = "Somatic Aebp2", no = name)) |>
  # actually simplify and label the embryonic Aebp2 just as "Aepb2"
  mutate(name = ifelse(test = name == "Embryo Aebp2", yes = "Aebp2", no = name)) |>
  mutate(name = ifelse(test = name == "Jard2", yes = "Jarid2", no = name) ) -> data_results

Select proteins to label and set fill colour for Aebp2 and Jarid2

Code
prot_to_label <- c("Aebp2", "Jarid2", "Stag2", "Dhb12", "Phf11", "Rpb4", "Rprd2",
                   "Chm2a", "Nsun5", "Zn296", "Ari5b", "Ezhip")

data_results <- mutate(data_results, 
                       Label = ifelse(name %in% prot_to_label, yes = TRUE, no = FALSE),
                       pretty_name = case_when(Label == TRUE ~ str_wrap(name, width = 10),
                                         Label == FALSE ~ name) ) |>
  mutate(Label_fill = case_when(name == "Aebp2" ~ "forestgreen",
                                name == "Jarid2" ~ "hotpink", 
                                !name %in% c("Aebp2", "Jard2") ~ "white"))

Make the volcano plot

Code
ggplot(data = data_results) +
  aes(x = KOrS_vs_KOrL_ratio, y = -log10(KOrS_vs_KOrL_p.val), 
      fill = KOrS_vs_KOrL_significant ) +
  geom_point(shape = 21, show.legend = F, stroke = 0.16)  +
  labs(x = expression(log[2] ~ "Fold Change"), y = expression(-log[10] ~ "P-Value")) +
  coord_cartesian(xlim = c(-7.5, 7.5), ylim = c(0, 8.0), clip = 'on') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.01), add = c(0.02, 0.04)),
                     n.breaks = 7) +
  scale_x_continuous(n.breaks = 10, expand = expansion(mult = 0, add = 0), oob = scales::oob_squish ) +
  Volcano_plot_theme -> p_basic_Volcano

p_Volcano2 <- p_basic_Volcano + 
  annotate(geom = "label", x = 5.0,  y = 0.35, label = "KO+Suz12S", 
           colour = "black", fill = "darkorange1", size = 2,
           label.padding = grid::unit(0.5, "mm"), 
           label.r = unit(0.5, "mm"), family = "Arial",
           label.size = grid::unit(0.125, "mm") ) +
  annotate(geom = "label", x = -5.0, y = 0.35, label = "KO+Suz12L", 
           colour = "black",  fill = "mediumpurple3", size = 2, 
           label.padding = grid::unit(0.5, "mm"), 
           label.r = unit(0.25, "mm"), family = "Arial",
           label.size = grid::unit(0.125, "mm") ) +
  # points on the right side
  geom_label_repel(data = subset(data_results, Label & KOrS_vs_KOrL_ratio > 0 ), 
                   aes(label = pretty_name, fill = Label_fill), 
                   seed = 16, show.legend = F, segment.curvature = -1e-20, 
                   family = "Arial", size = 2, nudge_x = -0.5,
                   segment.color = 'black',verbose = F, 
                   box.padding = grid::unit(1, "mm"),
                   point.padding = grid::unit(0.55, "mm"),
                   label.padding = grid::unit(0.5, "mm"), 
                   label.size = 0.125, 
                   max.overlaps = 20)  +
  # points on the left side
  geom_label_repel(data = subset(data_results, Label & KOrS_vs_KOrL_ratio < 0), 
                 aes(label = pretty_name, fill = Label_fill), 
                 seed = 16, show.legend = F, segment.curvature = -1e-20, 
                 family = "Arial", size = 2, nudge_x = 0.1,
                 segment.color = 'black',verbose = F, 
                 box.padding = grid::unit(1, "mm"),
                 point.padding = grid::unit(0.55, "mm"),
                 label.padding = grid::unit(0.5, "mm"), 
                 label.size = 0.125, 
                 max.overlaps = 20)  +
  scale_fill_manual(values = c('gray84', 'forestgreen', 'hotpink',  'firebrick2', 'white'))

p_Volcano2

Save to pdf

Code
ggsave(filename = "FigS2L_KOrS_vs_KOrL_Volcano.pdf", plot = p_Volcano2,
       device = cairo_pdf, path = pdf_dir_fig2, width = 4.5, height = 5.2, units = "cm")

Make it interactive

Code
p_interactive_Volcano <- ggplot(data = data_results) +
  aes(x = KOrS_vs_KOrL_ratio, y = -log10(KOrS_vs_KOrL_p.val), 
      fill = KOrS_vs_KOrL_significant, group = 1 ) +
  geom_point(shape = 21, show.legend = F, aes(text = paste0(name)) ) +
  labs(x = expression(log[2] ~ "Fold Change"), y = expression(-log[10] ~ "P-Value")) +
  coord_cartesian(xlim = c(-7, 7.0), clip = 'on') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.01), add = c(0.02, 0.04)),
                     n.breaks = 7) +
  scale_x_continuous(n.breaks = 10, expand = expansion(mult = 0, add = 0) ) +
  scale_fill_manual(values = c("TRUE" = 'firebrick2', "FALSE" = 'gray84') ) +
  labs(x = "log2 Fold Change", y = "-log10 P-Value") +
  Volcano_plot_theme 
Code
ggplotly(p_interactive_Volcano, tooltip = "text")

Add Mean PSMs

Code
assay(data_imp_mixed) |>
  as.data.frame() |>
  rownames_to_column("name") |>
  pivot_longer(cols = !c("name"), names_to = "Sample", values_to = "Norm_PSMs") |>
  mutate(Condition = case_when( grepl(x = Sample, pattern = "KOrL") ~ "KO+L",
                                grepl(x = Sample, pattern = "KOrS") ~ "KO+S",
                                grepl(x = Sample, pattern = "KO_") ~ "KO") ) |>
  relocate(Condition, .after = Sample) |>
  mutate(Condition = factor(Condition, levels = c("KO+L", "KO+S", "KO") ) ) |>
  group_by(name) |>
  mutate(Mean_PSM = mean(Norm_PSMs, na.rm = T)) |>
  select(name, Mean_PSM) |>
  unique() |>
  mutate(name = ifelse(test = name == "Aebp2.1", yes = "Embryo Aebp2", no = name)) |>
  mutate(name = ifelse(test = name == "Aebp2", yes = "Somatic Aebp2", no = name)) |>
  # actually simplify and label the embryonic Aebp2 just as "Aepb2"
  mutate(name = ifelse(test = name == "Embryo Aebp2", yes = "Aebp2", no = name)) |>
  mutate(name = ifelse(test = name == "Jard2", yes = "Jarid2", no = name) ) |>
  right_join(data_results, by = "name") |>
  relocate(ID, .after = name) -> res_KO_rescues

Define proteins to label in the MA plot

Code
prot_to_label_MA <- c(PRC2, PRC2.1, PRC2.2, "Stag2", "Phf11", "Rprd2", "Addg",
                      "Chm2a", "Nsun5", "Zn296", "Anxa6", "Ezhip", "Eloc", "H4",
                      "Epop", "Elob", "Dyhc1", "Myo1e",  "Brd2", "Gptch4", 
                      "Smrd2", "Actb", "Myh14", "Cor2b") |> unique()

Plot Mean PSMs vs log2 Fold Change.

Code
res_KO_rescues |>
mutate(Label = ifelse(name %in% prot_to_label_MA, yes = TRUE, no = FALSE),
       pretty_name = case_when(Label == TRUE ~ str_wrap(name, width = 10), Label == FALSE ~ name) ) |>
  mutate(Label_fill = case_when(name == "Aebp2" ~ "forestgreen",
                                name == "Jarid2" ~ "hotpink", 
                                !name %in% c("Aebp2", "Jarid2") ~ "white")) -> res_KO_rescues

ggplot(res_KO_rescues) +
  aes(x = Mean_PSM, y = KOrS_vs_KOrL_ratio, fill = KOrS_vs_KOrL_significant ) +
  geom_point(shape = 21, show.legend = F, stroke = 0.16)  +
  geom_hline(yintercept = 0, colour = "black") +
  labs(x = expression(log[2] ~ "Mean PSM"), y = expression(-log[2] ~ "Fold Change")) +
  scale_y_continuous(n.breaks = 7) +
  scale_x_continuous(n.breaks = 7) -> p_basic_MA_plot

p_basic_MA_plot +
  # points on top
  geom_label_repel(data = subset(res_KO_rescues, Label & KOrS_vs_KOrL_ratio > 0 ), 
                   aes(label = pretty_name, fill = Label_fill), 
                   seed = 16, show.legend = F, segment.curvature = -1e-20, 
                   family = "Arial", size = 2, nudge_x = -0.25,
                   segment.color = 'black',verbose = F, 
                   box.padding = grid::unit(1, "mm"),
                   point.padding = grid::unit(0.55, "mm"),
                   label.padding = grid::unit(0.5, "mm"), 
                   label.size = 0.125, 
                   max.overlaps = 20)  +
  # points on the bottom
  geom_label_repel(data = subset(res_KO_rescues, Label & KOrS_vs_KOrL_ratio < 0), 
                 aes(label = pretty_name, fill = Label_fill), 
                 seed = 16, show.legend = F, segment.curvature = -1e-20, 
                 family = "Arial", size = 2, nudge_x = 0.25,
                 segment.color = 'black',verbose = F, 
                 box.padding = grid::unit(1, "mm"),
                 point.padding = grid::unit(0.55, "mm"),
                 label.padding = grid::unit(0.5, "mm"), 
                 label.size = 0.125, 
                 max.overlaps = 20) +

  scale_fill_manual(values = c('gray84', "forestgreen", "hotpink", 'firebrick2', "white")) +
  Volcano_plot_theme -> p_MA_plot_Rescues
p_MA_plot_Rescues

Save to pdf.

Code
ggsave(filename = "MAplot_KOrS_vs_KOrL.pdf", plot = p_MA_plot_Rescues,
       device = cairo_pdf, path = pdf_dir_fig2, width = 4.5, height = 5.25, units = "cm")
Code
p_interactive_MA <- p_basic_MA_plot +
  geom_point(shape = 21, show.legend = F, aes(text = paste0(name)) ) +
  scale_fill_manual(values = c("TRUE" = 'firebrick2', "FALSE" = 'gray84') ) +
  labs(x = "log2 Mean PSMs", y = "log2 Fold Change") +
  Volcano_plot_theme 

I also made an interactive MA plot if one is interested in exploring more the proteins.

Code
ggplotly(p_interactive_MA)

4.2 Check overlap between SUZ12-IP and FLAG-IP

Check how many proteins are in each IP-MS experiments and how many are shared or unique to each experiment. For this analysis I use the protein “name” as the identified. One can also change the $name, to $UniProtID and use the UniProtIDs as protein identifiers.

Code
message(length(res_WT_Dexon4$name), '\tproteins idenfied with SUZ12 IP-MS')
771 proteins idenfied with SUZ12 IP-MS
Code
message(length(res_KO_rescues$name), '\tproteins idenfied with FLAG IP-MS')
1276    proteins idenfied with FLAG IP-MS

Make a list of all identified proteins for the upset plot

Code
all_identified_proteins <- list(res_WT_Dexon4$name, res_KO_rescues$name)
names(all_identified_proteins) <- c('All SUZ12 IP',  'All FLAG IP')

Make an UpSet plot to show the intersection betweenn all proteins in the SUZ12 and FLAG IP.

Code
prot_intersection <- upset(fromList(all_identified_proteins),
                      text.scale = 1.5, nintersects = NA, 
                      mb.ratio = c(0.65, 0.35), decreasing = c(T, F),
                      order.by = 'freq',
                      main.bar.color = "gray16") 

Show intersection between SUZ12 and FLAG IPs.

Code
prot_intersection

Save this upset plot to pdf.

Code
cairo_pdf(file = file.path(pdf_dir_fig2, 'Rev1_p2_intersection_all_proteins_IPs.pdf'), 
          width = 4, height = 4, bg = NA)
prot_intersection
dev.off()
quartz_off_screen 
                2 

Save this upset plot jpeg.

Code
jpeg(file = file.path(pdf_dir_fig2, 'Rev1_p2_intersection_all_proteins_IPs.jpeg'), 
     width = 4, height = 4, units = "in", res = 150)
prot_intersection
dev.off()
quartz_off_screen 
                2 

Create a dataframe with the shared proteins and unique proteins to each IP. This is later on saved to a csv file.

Code
suz12_flag <- intersect(res_WT_Dexon4$name, res_KO_rescues$name)
suz12_specific <- res_WT_Dexon4$name[!res_WT_Dexon4$name %in% suz12_flag]
flag_specific <- res_KO_rescues$name[!res_KO_rescues$name %in% suz12_flag]

# pad with NAs smaller intersection
if( length(flag_specific) > length(suz12_specific) ) {
  length(suz12_specific) <- length(flag_specific)  
  length(suz12_flag) <- length(flag_specific) 
}

all_prot_intersection <- data.frame(SUZ12_isect_FLAG = suz12_flag,
                                    SUZ12_specific = suz12_specific,
                                    FLAG_specific = flag_specific) 

Set thresholds for identifing “enriched” proteins.

Code
ratio_thrshld <- 1
pval_thshld <- 0.1

Identify “enriched proteins” by having |ratio| >= 1 and P-value =< 0.1 (lose stringency).

Code
wt_enriched <- subset(res_WT_Dexon4, Dexon4_vs_WT_ratio <= -ratio_thrshld &
                          Dexon4_vs_WT_p.val <= pval_thshld )$name

dex4_enriched <- subset(res_WT_Dexon4, Dexon4_vs_WT_ratio >= ratio_thrshld &
                          Dexon4_vs_WT_p.val <= pval_thshld )$name

KOrL_enriched <- subset(res_KO_rescues, KOrS_vs_KOrL_ratio <= -ratio_thrshld &
                          KOrS_vs_KOrL_p.val <= pval_thshld )$name

KOrS_enriched <- subset(res_KO_rescues, KOrS_vs_KOrL_ratio >= ratio_thrshld &
                          KOrS_vs_KOrL_p.val <= pval_thshld )$name

message(length(wt_enriched), '\tproteins enriched in WT vs ∆ex4 (SUZ12 IP-MS)')
31  proteins enriched in WT vs ∆ex4 (SUZ12 IP-MS)
Code
message(length(dex4_enriched), '\tproteins enriched in ∆ex4 vs WT (SUZ12 IP-MS)')
31  proteins enriched in ∆ex4 vs WT (SUZ12 IP-MS)
Code
message(length(KOrL_enriched), '\tproteins enriched in KO+L vs KO+S (FLAG IP-MS)')
376 proteins enriched in KO+L vs KO+S (FLAG IP-MS)
Code
message(length(KOrS_enriched), '\tproteins enriched in KO+S vs KO+L (FLAG IP-MS)')
104 proteins enriched in KO+S vs KO+L (FLAG IP-MS)

Make a list of for the upset plot of enriched proteins

Code
identified_proteins <- list(wt_enriched, dex4_enriched, 
                            KOrL_enriched, KOrS_enriched)
names(identified_proteins) <- c('WT enriched',  '∆ex4 enriched',
                                'KO+L enriched', 'KO+S enriched')

Make an UpSet plot

Code
prot_overlap <- upset(fromList(identified_proteins), nsets = length(identified_proteins),
                      text.scale = 1.5, nintersects = NA, 
                      mb.ratio = c(0.65, 0.35), decreasing = c(T, F),
                      order.by = 'freq',
                      main.bar.color = "gray16") 

Show overlap.

Code
prot_overlap

Save this upset plot to pdf.

Code
cairo_pdf(file = file.path(pdf_dir_fig2, 'Rev1_p2_overlap_enriched_proteins_IPs.pdf'), 
          width = 6, height = 4, bg = NA)
prot_overlap
dev.off()
quartz_off_screen 
                2 

Save this upset plot jpeg.

Code
jpeg(file = file.path(pdf_dir_fig2, 'Rev1_p2_overlap_enriched_proteins_IPs.jpeg'), 
          width = 6, height = 4, units = "in", res = 150)
prot_overlap
dev.off()
quartz_off_screen 
                2 

Check the proteins between each overlap.

Code
wt_koL <- intersect(wt_enriched, KOrL_enriched)
wt_koS <- intersect(wt_enriched, KOrS_enriched)
dex_koL <- intersect(dex4_enriched, KOrL_enriched)
dex_koS <- intersect(dex4_enriched, KOrS_enriched)

# pad with NAs smaller intersection
if( length(wt_koL) > length(dex_koL) ) {
  length(wt_koS) <- length(wt_koL)  
  length(dex_koL) <- length(wt_koL) 
  length(dex_koS) <- length(wt_koL) 
} else if ( length(wt_koL) < length(dex_koL) ) {
  length(wt_koS) <- length(dex_koL)  
  length(wt_koL) <- length(dex_koL) 
  length(dex_koS) <- length(dex_koL) 
} else if (length(wt_koL) == length(dex_koL) ) {
  length(wt_koS) <- length(dex_koL)   
  length(dex_koS) <- length(dex_koL) 
} else {
  stop("I'm having issues with vector lengths")
}

intersections <- data.frame(WT_isect_KOrL = wt_koL, 
                            WT_isect_KOrS = wt_koS,
                            dex4_isect_KOrL = dex_koL, 
                            dex4_isect_KOrS = dex_koS) 

Print summary of enriched proteins overlap.

Code
kable(intersections)
Overlap of significant proteins between SUZ12-IP and FLAG-IP
WT_isect_KOrL WT_isect_KOrS dex4_isect_KOrL dex4_isect_KOrS
Aebp2 Cg050 A0a0r4j0j6 NA
Ash2l Myh10 Anm1 NA
Brx1 Myh9 Cdk13 NA
Cecr2 NA Hnrh2 NA
Chap1 NA Hnrpq NA
Ddx51 NA If2b3 NA
E9qn31 NA Ncoa5 NA
Ebp2 NA Prp6 NA
Esco2 NA Rbm22 NA
Ezh1 NA Syf1 NA
Ezhip NA Tfp11 NA
Ing5 NA NA NA
Q8k205 NA NA NA
Utf1 NA NA NA

4.3 Export results to table

Export for supplementary table the results dataframes

Code
res_WT_Dexon4 <- select(res_WT_Dexon4, -c("Dexon4_centered","WT_centered", "significant") )
colnames(res_WT_Dexon4) <- gsub("Dexon4_vs_WT", "∆ex4vsWT", colnames(res_WT_Dexon4) ) 

write_excel_csv(x = res_WT_Dexon4, file = file.path(tbl_dir_fig2, 'WT_vs_dEX4_res.csv'),
                col_names = T, quote = "all")

res_KO_rescues <- select(res_KO_rescues, c("name","ID", "Mean_PSM", starts_with('KOrS_vs_KOrL')) )
colnames(res_KO_rescues) <- gsub("KOrS_vs_KOrL", "KO+S_vs_KO+L", colnames(res_KO_rescues) ) 

write_excel_csv(x = res_KO_rescues, file = file.path(tbl_dir_fig2, 'KO-L_vs_KO-S_res.csv'),
                col_names = T, quote = "all")

Export the intersection between the 2 IPs.

Code
write_excel_csv2(x = all_prot_intersection, 
                 file = file.path(tbl_dir_fig2, 'SUZ12_FLAG_IPs_all_proteins_intersection.xls'),
                 na = "NA", append = F, col_names = T, delim = ";", quote = "all", eol = "\n")

Export overlap dataframe of enriched proteins.

Code
write_excel_csv2(x = intersections, 
                 file = file.path(tbl_dir_fig2, 'SUZ12_FLAG_IPs_enriched_proteins_overlaps.xls'),
                 na = "NA", append = F, col_names = T, delim = ";", quote = "all", eol = "\n")

4.4 Targeted proteomic identification of SUZ12-S peptide

Show the MS/MS spectra of the SUZ12 Short peptides in WT and ∆ex4 mESCs.

Tidy and reshape the PRM data imported before.

Code
PRM |>
  subset(Replicate %in% c(2, 4)) |>
  subset(Peptide == skipping_peptide) |> 
  subset(Precursor_Charge == 3)  |>
  subset(grepl("^y|b", x = Fragment_Ion)) |>
  subset(Fragment_Ion != "y14") |>
  mutate(Peak_Rank = Peak_Rank - 1) |>
  mutate(Ion_Type = ifelse(test = grepl(pattern = "^b", x = Fragment_Ion), yes = "N-term", no = "C-term") ) |>
  mutate(Sample_Type = ifelse(grepl(Replicate, pattern =  2), yes = "WT", no = "∆ex4")) |>
  mutate(Sample_Type = factor(Sample_Type, levels = c("WT", "∆ex4"))) -> tidy_PRM

Glimpse into the targeted proteomics data

Code
kable( slice_head(tidy_PRM) )
General parameters of SUZ12-S peptide in targeted proteomics experiment
Peptide Protein Replicate Precursor_Mz Precursor_Charge Product_Mz Product_Charge Fragment_Ion Area Background Peak_Rank Retention_Time Ion_Type Sample_Type
SLSAHLQLTFTGFFHK E9PW15 2 611.9931 3 984.4938 1 y8 32633 0 5 59.23 C-term WT

Plot supplementary figure 1E.

Code
ggplot(tidy_PRM) +
  aes(x = Product_Mz, y = Area /10^3, colour = Ion_Type, group = Ion_Type) +
  facet_wrap(~Sample_Type, scales = "fixed") +
  geom_col(show.legend = F, linewidth = 0.2) +
  geom_text(data = subset(tidy_PRM,  Ion_Type == "N-term"),
                  aes(label = Fragment_Ion), 
                  family = "Arial", size = 1.25, nudge_y = 7, nudge_x = -20 )+
  
  geom_text(data = subset(tidy_PRM,  Ion_Type == "C-term"),
                  aes(label = Fragment_Ion),
                  family = "Arial", size = 1.25, nudge_y = 7, nudge_x = 20) + #
  scale_x_continuous(n.breaks = 4) +
  scale_y_continuous(expand = expansion(add = 0, mult = c(0, 0.1)) ) + 
  scale_colour_manual(values = c("#C98686", "#966B9D"), name = "Ion Type") +
  coord_cartesian(xlim = c(400, NA)) +
  labs(x = "m/z", title = paste0("Peptide: ", unique(tidy_PRM$Peptide))) +
  theme_classic(base_size = 6, base_family = "Arial") +
  theme(panel.grid.major = element_line(linewidth = 0.15, colour = "gray84"),
        axis.text = element_text(colour = "black"),
        axis.title = element_text(size = 5),
        axis.line = element_line(linewidth = 0.1),
        plot.title = element_text(size = 5, hjust = 0.5, 
                                  margin = margin(b = -1, unit = "mm")),
        axis.ticks = element_line(linewidth = 0.1),
        legend.position = c(0.125, 0.9),
        legend.box.background = element_blank(),
        legend.key = element_blank(),
        legend.key.size = unit(0, "mm"),
        legend.title = element_blank(),
        legend.text = element_text(margin = margin(l = -1, unit = "mm") ),
        legend.text.align = 0,
        strip.background = element_blank()) -> p_spectra
p_spectra

Save to pdf.

Code
ggsave(filename = "FigS2E_SUZ12_Short_Peptide_Spectra.pdf", plot = p_spectra,
       device = cairo_pdf, path = pdf_dir_fig2, width = 4.5, height = 2.8, units = "cm")

4.5 Peptide coverage

4.5.1 MzTab PSMs

First use the PSMs from the MzTab file for SUZ12 and Jarid2

Code
Suz12_UniProt_ID <- 'Q80U70'
Jarid2_UniProt_ID <-'Q62315'
list_of_UniProtIDs <- c(Suz12_UniProt_ID, Jarid2_UniProt_ID)
lapply(list_of_UniProtIDs, function(x){
  # message('Parsing UniProtID: ', x)
  read_psm_MzTab(MzTab_path = WT_Dex4_mzTAB_path, 
                 POI_UniProtID = x, parse_accession = T)
}) -> list_df
prot_pep_coverage <- do.call('rbind', list_df ) 

Plot the 2 proteins

Code
ggplot(prot_pep_coverage)  + 
  facet_wrap( ~ Protein, nrow = 2, scales = "free")  +
  aes(x = start, xend = end, y = UniProtID, yend = UniProtID, color = UniProtID) +
  geom_segment(size = 1.5, show.legend = F) +
  labs(x = "Amino acids position", y = "") +
  scale_x_continuous(expand = expansion(add = 0.1), 
                     n.breaks = 10, limits = c(0, NA)) +
  pep_coverage_theme
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

4.5.2 Saint peptide counts

The MzTab file from proteomicslfq doesn’t contain the peptide info per sample, so I use the Saint software (Teo et al. 2014) output.

I created a simple R function to read the peptides of any given UniProtID from the excel output tables using read_saint_peps(). This peptides are then mapped to the protein sequence where I extract start and end position and prepare them for plotting with the function cov2res_peptides(). This functions allow to specify the exon boundary amino acids (with AA_exon_start and AA_exon_end), in this way the function summary_exon_coverage() prints out the number of peptides that map to that exonic region. Lastly the peptides are plotted along the protein sequence using plot_pep_cov().

4.5.2.1 WT and ∆ex4 IP

Import SUZ12 peptides

Code
suz12_pep_cov <- read_saint_peps(saint_path = saint_path_WT_dEx4, 
                                 UniProtID = Suz12_UniProt_ID, 
                                 AA_exon_start = 129, AA_exon_end = 153,
                                 eoi_name = 'exon4')
Please wait we are processing your accessions ...
80 peptides could map
79 actually map precisely
Code
# jarid2_pep_cov <- read_saint_peps(saint_path = saint_path_WT_dEx4, 
#                                  UniProtID = Jarid2_UniProt_ID, 
#                                  AA_exon_start = 1, AA_exon_end = 2,
#                                  eoi_name = 'exon4')

Check peptide counts in SUZ12 exon 4

Code
kable(summary_exon_coverage(suz12_pep_cov))
Exon 4 peptide coverage
Region Pepetide_Spectrum_Matches
exon4 40
other_exons 454
total_protein 494

Reshape coverage data and rename sample names and turn them into a factor so that they are in the right order for plotting them.

Code
suz12_res <- suz12_pep_cov |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "WTp",
                                 Sample == "B" ~ "WT#1",
                                 Sample == "C" ~ "WT#2",
                                 Sample == "D" ~ "∆EX4#1",
                                 Sample == "E" ~ "∆EX4#2",
                                 Sample == "F" ~ "∆EX4#3"),
         .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, 
                              levels = c("WTp", "WT#1", "WT#2", 
                                         "∆EX4#1", "∆EX4#2", "∆EX4#3")))

Plot SUZ12 coverage. Each peptide is coloured by their area under the peak intensity.

Code
p_suz12_cov <- plot_pep_cov(df_res = suz12_res, AA_exon_start = 129, AA_exon_end = 152)
p_suz12_cov

Save supplementary figure 2F to pdf.

Code
ggsave(filename = "FigS2F_WT_dex4_IP_Peptides_Coverage.pdf", plot = p_suz12_cov,
       device = cairo_pdf, path = pdf_dir_fig2, width = 16.5, height = 3.5, units = "cm")

To address reviewer #3 comment #2 I check the peptides coverage for in 2 other PRC2 PanAS exons in EZH2 (UniProtID: Q61188).

Code
read_saint_peps(saint_path = saint_path_WT_dEx4, 
                UniProtID = 'Q61188', 
                AA_exon_start = 83, AA_exon_end = 121,
                eoi_name = 'exon 4') |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "WTp", Sample == "B" ~ "WT#1",
                                 Sample == "C" ~ "WT#2", Sample == "D" ~ "∆EX4#1",
                                 Sample == "E" ~ "∆EX4#2", Sample == "F" ~ "∆EX4#3"), .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, 
                              levels = c("WTp", "WT#1", "WT#2", 
                                         "∆EX4#1", "∆EX4#2", "∆EX4#3"))) -> df_ezh2_ex4_wt
Please wait we are processing your accessions ...
38 peptides could map
37 actually map precisely

Plot EZH2 exon 4 coverage

Code
plot_pep_cov(df_res = df_ezh2_ex4_wt, AA_exon_start = 83, AA_exon_end = 121) +
  labs(title = 'EZH2 exon 4 (SUZ12-IP)') -> p_EZH2_ex4_cov
p_EZH2_ex4_cov

Only part of the exon is covered but we can’t claim major differences in peptides abunance.

Save to pdf.

Code
ggsave(filename = "Rev3_EZH2ex4_WT_Dex4_IP_Peptides_Coverage.pdf", 
       plot = p_EZH2_ex4_cov, device = cairo_pdf, path = pdf_dir_fig2, 
       width = 16.5, height = 3.5, units = "cm")
Code
read_saint_peps(saint_path = saint_path_WT_dEx4, 
                UniProtID = 'Q61188', 
                AA_exon_start = 516, AA_exon_end = 558,
                eoi_name = 'exon 14') |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "WTp", Sample == "B" ~ "WT#1",
                                 Sample == "C" ~ "WT#2", Sample == "D" ~ "∆EX4#1",
                                 Sample == "E" ~ "∆EX4#2", Sample == "F" ~ "∆EX4#3"), .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, levels = c("WTp", "WT#1", "WT#2", "∆EX4#1", "∆EX4#2", "∆EX4#3"))
         ) -> df_ezh2_ex14_wt
Please wait we are processing your accessions ...
38 peptides could map
37 actually map precisely

Plot EZH2 exon 14 from the same dataset.

Code
plot_pep_cov(df_res = df_ezh2_ex14_wt, AA_exon_start = 516, AA_exon_end = 558) +
  labs(title = 'EZH2 exon 14 (SUZ12-IP)') -> p_EZH2_ex14_cov
p_EZH2_ex14_cov

Save to pdf

Code
ggsave(filename = "Rev3_EZH2ex14_WT_Dex4_IP_Peptides_Coverage.pdf", 
       plot = p_EZH2_ex14_cov, device = cairo_pdf, path = pdf_dir_fig2, 
       width = 16.5, height = 3.5, units = "cm")

Do the same kind of plots for the FLAG IP of the KO rescues.

4.5.2.2 KO rescues flag-IP

Import the data using the same ad hoc function.

Code
suz12_pep_cov <- read_saint_peps(saint_path = saint_path_KO_Rescues, 
                                 UniProtID = Suz12_UniProt_ID, 
                                 AA_exon_start = 129, AA_exon_end = 153,
                                 eoi_name = 'exon4')
Please wait we are processing your accessions ...
95 peptides could map
94 actually map precisely

Reshape coverage data and rename sample names.

Code
suz12_res <- suz12_pep_cov |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "KO#1",
                                 Sample == "B" ~ "KO#2",
                                 Sample == "C" ~ "KO#3",
                                 Sample == "D" ~ "KO+L#1",
                                 Sample == "E" ~ "KO+L#2",
                                 Sample == "F" ~ "KO+L#3",
                                 Sample == "G" ~ "KO+S#1",
                                 Sample == "H" ~ "KO+S#2",
                                 Sample == "I" ~ "KO+S#3",
                                 ),
         .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, 
                              levels = c("KO+L#1", "KO+L#2", "KO+L#3",
                                         "KO+S#1", "KO+S#2", "KO+S#3",
                                         "KO#1", "KO#2", "KO#3")))

Plot SUZ12 coverage in rescues IP

Code
p_suz12_cov_rescues <- plot_pep_cov(df_res = suz12_res, 
                                    AA_exon_start = 129, AA_exon_end = 152)
p_suz12_cov_rescues

Note

Please note that there’s no peptide mapping to SUZ12 for sample KO#1 and KO#2 and in the case of KO#3 the peptides are only in exon 1 which is before the SUZ12 exon 2-3 deletion.

Save KO rescues peptides coverage to pdf.

Code
ggsave(filename = "FigS2F_KO_Rescues_IP_Peptides_Coverage.pdf", plot = p_suz12_cov_rescues,
       device = cairo_pdf, path = pdf_dir_fig2, width = 16.5, height = 3.5, units = "cm")

Check EZH2 peptides coverage also in this dataset like before.

Code
read_saint_peps(saint_path = saint_path_KO_Rescues, 
                UniProtID = 'Q61188', 
                AA_exon_start = 83, AA_exon_end = 121,
                eoi_name = 'exon4') |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "KO#1",
                                 Sample == "B" ~ "KO#2",
                                 Sample == "C" ~ "KO#3",
                                 Sample == "D" ~ "KO+L#1",
                                 Sample == "E" ~ "KO+L#2",
                                 Sample == "F" ~ "KO+L#3",
                                 Sample == "G" ~ "KO+S#1",
                                 Sample == "H" ~ "KO+S#2",
                                 Sample == "I" ~ "KO+S#3",
                                 ),
         .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, 
                              levels = c("KO+L#1", "KO+L#2", "KO+L#3",
                                         "KO+S#1", "KO+S#2", "KO+S#3",
                                         "KO#1", "KO#2", "KO#3"))
         ) -> df_ezh2_ex4_flag
Please wait we are processing your accessions ...
46 peptides could map
46 actually map precisely

Plot coverage

Code
plot_pep_cov(df_res = df_ezh2_ex4_flag, AA_exon_start = 83, AA_exon_end = 121) +
  labs(title = 'EZH2 exon 4 (FLAG-IP)') -> p_EZH2_ex4_cov_flag
p_EZH2_ex4_cov_flag

Not peptides could be mapped to exon 4 of EZH2 as it can also be checked with:

Code
summary_exon_coverage(df_ezh2_ex4_flag)
# A tibble: 2 × 2
  Region        Pepetide_Spectrum_Matches
  <chr>                             <dbl>
1 other_exons                         320
2 total_protein                       320

Save to pdf.

Code
ggsave(filename = "Rev3_EZH2ex4_KO_Rescues_IP_Peptides_Coverage.pdf", 
       plot = p_EZH2_ex4_cov_flag, device = cairo_pdf, path = pdf_dir_fig2, 
       width = 16.5, height = 3.5, units = "cm")
Code
read_saint_peps(saint_path = saint_path_KO_Rescues, 
                UniProtID = 'Q61188', 
                AA_exon_start = 516, AA_exon_end = 558,
                eoi_name = 'exon4') |>
  cov2res_peptides() |>
  mutate(Sample_Name = case_when(Sample == "A" ~ "KO#1",
                                 Sample == "B" ~ "KO#2",
                                 Sample == "C" ~ "KO#3",
                                 Sample == "D" ~ "KO+L#1",
                                 Sample == "E" ~ "KO+L#2",
                                 Sample == "F" ~ "KO+L#3",
                                 Sample == "G" ~ "KO+S#1",
                                 Sample == "H" ~ "KO+S#2",
                                 Sample == "I" ~ "KO+S#3",
                                 ),
         .before = Sample) |>
  mutate(Sample_Name = factor(Sample_Name, 
                              levels = c("KO+L#1", "KO+L#2", "KO+L#3",
                                         "KO+S#1", "KO+S#2", "KO+S#3",
                                         "KO#1", "KO#2", "KO#3"))
         ) -> df_ezh2_ex14_flag

Plot.

Code
plot_pep_cov(df_res = df_ezh2_ex14_flag, AA_exon_start = 516, AA_exon_end = 558) +
  labs(title = 'EZH2 exon 14 (FLAG-IP)') -> p_EZH2_ex14_cov_flag
p_EZH2_ex14_cov_flag

Save to pdf

Code
ggsave(filename = "Rev3_EZH2ex14_KO_Rescues_IP_Peptides_Coverage.pdf", 
       plot = p_EZH2_ex14_cov_flag, device = cairo_pdf, path = pdf_dir_fig2, 
       width = 16.5, height = 3.5, units = "cm")
Comments on the identification of AS isoforms peptides with Mass spectrometry

Due to MS limitations it is very difficult if not impossible to detect different AS from a label free IP-MS experiment. In the case of SUZ12-S peptide we performed a targeted MS protocol. A Precise Reaction Monitoring (PRM) method was used to acquire the data where we selected only 3 specific SUZ12 peptides for further fragmentation.

Code
all_ezh2_cov <- wrap_plots(p_EZH2_ex4_cov, p_EZH2_ex14_cov, 
                           p_EZH2_ex4_cov_flag, p_EZH2_ex14_cov_flag, 
                           ncol = 1, nrow = 4)
Code
all_ezh2_cov

Save to jpeg

Code
ggsave(filename = "Rev3_EZH2ex4-14_SUZ12_FLAG_IP_Peptides_Coverage.jpeg", 
       plot = all_ezh2_cov, device = 'jpeg', path = pdf_dir_fig2, 
       width = 16.5, height = 14, units = "cm", dpi = 300)

I could check other AS exons in PRC2 but for now I want to stop here.

4.6 qPCR

Analyse and plot Suz12 gene expression by qPCR in ESCs.

Code
qPCR <- read_delim(file = file.path(tbl_dir_fig2, "ESC_qPCR_expression_Ct_values.txt"),
                   delim = '\t', col_names = T, show_col_types = FALSE)
Code
qPCR$sample_name <- factor(qPCR$sample_name, levels = c("WT#1", "WT#2", "∆ex4#1", "∆ex4#2", "∆ex4#3"))
qPCR$target_name <- factor(qPCR$target_name, levels = c("Suz12", "Ezh2", "Nanog", "Tbp"))
qPCR$genotype <- factor(qPCR$genotype, levels = c('WT-like', '∆ex4') )
qPCR$C_T <- as.numeric(qPCR$C_T)

Get summary statistics

Code
df_reps_sum <- qPCR |>
  group_by(sample_name, target_name) |>
  summarize(Ct_mean = mean(C_T, na.rm = TRUE), 
            Ct_StDev = sd(C_T, na.rm = TRUE) ) |>
  ungroup()

Add replicates information.

Code
qPCR <- left_join(qPCR, df_reps_sum, by = c("sample_name", "target_name"))

Use Tbp as house keeping gene.

Code
house_kpng_gene <- subset(qPCR, target_name == "Tbp") %>%
  group_by(sample_name, target_name, plate) %>%
  dplyr::summarize(CT_mean_norm = mean(C_T, na.rm = TRUE),
                   Ct_StDev_norm = sd(C_T, na.rm = TRUE)) %>%
  ungroup()
house_kpng_gene$target_name <- NULL

Calculate ∆Ct relative to Tbp house keeping gene.

Code
qPCR <- left_join(qPCR, house_kpng_gene, by = c("sample_name", "plate")) %>%
  mutate(Delta_Ct = C_T - CT_mean_norm,
         Delta_Ct_st_dev = sqrt( (Ct_StDev^2) + (Ct_StDev_norm^2) ) )

Calculate ∆∆Ct relative to WT#1 as a reference sample.

Code
ref_Delta_Ct_WT <- subset(qPCR, sample_name == "WT#1") |>
  select(target_name, replicate, plate, Delta_Ct) |>
  group_by(target_name, plate) |>
  summarize(ref_Delta_Ct = mean(Delta_Ct, na.rm = TRUE) )

Add info back to qPCR dataframe.

Code
qPCR <- left_join(qPCR, ref_Delta_Ct_WT, by = c("target_name", "plate") ) |>
  mutate(Delta_Delta_Ct = Delta_Ct - ref_Delta_Ct)

Calculate fold change as \(2^{-∆∆Ct}\).

Code
qPCR <- group_by(qPCR, sample_name, target_name, plate) %>%
  summarize(Delta_Delta_Ct_mean = mean(Delta_Delta_Ct, na.rm = T),
            Delta_Delta_Ct_std = sqrt(var(Delta_Delta_Ct, na.rm = T))) |>
  ungroup() |>
  left_join(x = qPCR, by = c("sample_name", "target_name", "plate") ) |>
  mutate(Comp_Meth = 2^-(Delta_Delta_Ct_mean),
         Comp_Meth_SD.plu = 2^ - ( Delta_Delta_Ct_mean - Delta_Ct_st_dev),
         Comp_Meth_SD.min = 2^ - ( Delta_Delta_Ct_mean + Delta_Ct_st_dev) )

Plot qPCR fold change for pluripotency genes.

Code
subset(qPCR, target_name != "Tbp") |>
ggplot() +
  aes(x = sample_name,
      y = 2^-Delta_Delta_Ct_mean,
      ymin = Comp_Meth_SD.min,
      ymax = Comp_Meth_SD.plu,
      fill = genotype)  +
  geom_col(position = position_dodge(), width = 0.5, linewidth = 0.2, 
           colour = "black", show.legend = F)   +
  geom_errorbar(stat = "identity", width = 0.25, col = "grey16", linewidth = 0.2,
                position = position_dodge(width = 0.55)) +
  geom_point(inherit.aes = F, show.legend = F,
             aes(x = sample_name, y = 2^-Delta_Delta_Ct,
                 fill = target_name, group = replicate),
             shape = 21, col = "grey16", size = 1, alpha = 1, stroke = 0.2,
             position = position_dodge(width = 0.35) ) +
  facet_grid(~ target_name, scales = "fixed") +
  scale_fill_manual(name = "Cell line name",
                    values = c('WT-like' = "#377AA3", "∆ex4" = '#F7CB48') ) +
  scale_y_continuous(expand = expansion(add = c(0.01, 0.25), mult = 0),
                     labels = scales::number_format(accuracy = 0.1),
                     n.breaks = 5) +
  labs(x = "Sample",
       y = "2^-∆∆Ct\n(normalized to Tbp and WT)") +
  theme_classic(base_size = 6, base_family = "Arial") +
  theme(legend.position = "none",
        axis.text.y = element_text(colour = "black"),
        axis.text.x = element_text(colour = "black", hjust = 0.75, angle = 45),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 5),
        axis.ticks.length = unit(1, units = 'mm'),
        axis.ticks.y = element_line(linewidth = 0.2),
        axis.ticks.x = element_blank(),
        axis.line = element_line(linewidth = 0.2),
        strip.background = element_blank(),
        plot.background = element_blank(),
        panel.background = element_blank(),
        panel.grid.major.y = element_line(colour = 'gray84', linewidth = 0.1) ) -> p_qPCR
p_qPCR

Save to pdf.

Code
ggsave(filename = "FigS2D_qPCR_barplot.pdf", plot = p_qPCR, device = cairo_pdf, 
       path = pdf_dir_fig2, units = 'cm', width = 6.75, height = 3.5)

End of figure 2 and supplementary figure 2 analysis code.

5 Session Info

Code
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Catalina 10.15.7
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Madrid
 date     2023-09-08
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version   date (UTC) lib source
 abind                  1.4-5     2016-07-21 [1] CRAN (R 4.1.0)
 ade4                   1.7-22    2023-02-06 [1] CRAN (R 4.1.2)
 affy                   1.72.0    2021-10-26 [1] Bioconductor
 affyio                 1.64.0    2021-10-26 [1] Bioconductor
 airr                   1.4.1     2022-08-28 [1] CRAN (R 4.1.2)
 alakazam               1.2.1     2022-09-20 [1] CRAN (R 4.1.2)
 ape                    5.7-1     2023-03-13 [1] CRAN (R 4.1.2)
 assertthat             0.2.1     2019-03-21 [1] CRAN (R 4.1.0)
 backports              1.4.1     2021-12-13 [1] CRAN (R 4.1.0)
 Biobase              * 2.54.0    2021-10-26 [1] Bioconductor
 BiocGenerics         * 0.40.0    2021-10-26 [1] Bioconductor
 BiocManager            1.30.22   2023-08-08 [1] CRAN (R 4.1.2)
 BiocParallel           1.28.3    2021-12-09 [1] Bioconductor
 Biostrings           * 2.62.0    2021-10-26 [1] Bioconductor
 bit                    4.0.5     2022-11-15 [1] CRAN (R 4.1.2)
 bit64                  4.0.5     2020-08-30 [1] CRAN (R 4.1.0)
 bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.1.0)
 broom                  1.0.5     2023-06-09 [1] CRAN (R 4.1.2)
 car                    3.1-2     2023-03-30 [1] CRAN (R 4.1.2)
 carData                3.0-5     2022-01-06 [1] CRAN (R 4.1.2)
 cellranger             1.1.0     2016-07-27 [1] CRAN (R 4.1.0)
 circlize               0.4.15    2022-05-10 [1] CRAN (R 4.1.2)
 cli                    3.6.1     2023-03-23 [1] CRAN (R 4.1.2)
 clue                   0.3-64    2023-01-31 [1] CRAN (R 4.1.2)
 cluster                2.1.4     2022-08-22 [1] CRAN (R 4.1.2)
 codetools              0.2-19    2023-02-01 [1] CRAN (R 4.1.2)
 colorspace             2.1-0     2023-01-23 [1] CRAN (R 4.1.2)
 ComplexHeatmap         2.10.0    2021-10-26 [1] Bioconductor
 crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.1.2)
 crosstalk              1.2.0     2021-11-04 [1] CRAN (R 4.1.0)
 curl                   5.0.2     2023-08-14 [1] CRAN (R 4.1.2)
 data.table             1.14.8    2023-02-17 [1] CRAN (R 4.1.2)
 data.tree              1.0.0     2020-08-03 [1] CRAN (R 4.1.0)
 DBI                    1.1.3     2022-06-18 [1] CRAN (R 4.1.2)
 DelayedArray           0.20.0    2021-10-26 [1] Bioconductor
 DEP                  * 1.16.0    2021-10-26 [1] Bioconductor
 digest                 0.6.33    2023-07-07 [1] CRAN (R 4.1.2)
 doParallel             1.0.17    2022-02-07 [1] CRAN (R 4.1.2)
 dplyr                * 1.1.2     2023-04-20 [1] CRAN (R 4.1.2)
 DT                     0.29      2023-08-29 [1] CRAN (R 4.1.2)
 ellipsis               0.3.2     2021-04-29 [1] CRAN (R 4.1.0)
 evaluate               0.21      2023-05-05 [1] CRAN (R 4.1.2)
 fansi                  1.0.4     2023-01-22 [1] CRAN (R 4.1.2)
 farver                 2.1.1     2022-07-06 [1] CRAN (R 4.1.2)
 fastmap                1.1.1     2023-02-24 [1] CRAN (R 4.1.2)
 fdrtool                1.2.17    2021-11-13 [1] CRAN (R 4.1.0)
 foreach                1.5.2     2022-02-02 [1] CRAN (R 4.1.2)
 generics               0.1.3     2022-07-05 [1] CRAN (R 4.1.2)
 GenomeInfoDb         * 1.30.1    2022-01-30 [1] Bioconductor
 GenomeInfoDbData       1.2.7     2023-08-29 [1] Bioconductor
 GenomicAlignments      1.30.0    2021-10-26 [1] Bioconductor
 GenomicRanges        * 1.46.1    2021-11-18 [1] Bioconductor
 GetoptLong             1.0.5     2020-12-15 [1] CRAN (R 4.1.0)
 ggplot2              * 3.4.3     2023-08-14 [1] CRAN (R 4.1.2)
 ggpubr                 0.6.0     2023-02-10 [1] CRAN (R 4.1.2)
 ggrepel              * 0.9.3     2023-02-03 [1] CRAN (R 4.1.2)
 ggsci                  3.0.0     2023-03-08 [1] CRAN (R 4.1.2)
 ggsignif             * 0.6.4     2022-10-13 [1] CRAN (R 4.1.2)
 GlobalOptions          0.1.2     2020-06-10 [1] CRAN (R 4.1.0)
 glue                   1.6.2     2022-02-24 [1] CRAN (R 4.1.2)
 gmm                    1.8       2023-06-06 [1] CRAN (R 4.1.2)
 gprofiler2             0.2.2     2023-06-14 [1] CRAN (R 4.1.2)
 gridExtra              2.3       2017-09-09 [1] CRAN (R 4.1.0)
 gtable                 0.3.4     2023-08-21 [1] CRAN (R 4.1.2)
 hexbin                 1.28.3    2023-03-21 [1] CRAN (R 4.1.2)
 hms                    1.1.3     2023-03-21 [1] CRAN (R 4.1.2)
 htmltools              0.5.6     2023-08-10 [1] CRAN (R 4.1.2)
 htmlwidgets            1.6.2     2023-03-17 [1] CRAN (R 4.1.2)
 httpuv                 1.6.11    2023-05-11 [1] CRAN (R 4.1.2)
 httr                   1.4.7     2023-08-15 [1] CRAN (R 4.1.2)
 igraph                 1.5.1     2023-08-10 [1] CRAN (R 4.1.2)
 impute                 1.68.0    2021-10-26 [1] Bioconductor
 imputeLCMD             2.1       2022-06-10 [1] CRAN (R 4.1.2)
 IRanges              * 2.28.0    2021-10-26 [1] Bioconductor
 iterators              1.0.14    2022-02-05 [1] CRAN (R 4.1.2)
 jsonlite               1.8.7     2023-06-29 [1] CRAN (R 4.1.2)
 knitr                * 1.43      2023-05-25 [1] CRAN (R 4.1.2)
 labeling               0.4.2     2020-10-20 [1] CRAN (R 4.1.0)
 later                  1.3.1     2023-05-02 [1] CRAN (R 4.1.2)
 lattice                0.21-8    2023-04-05 [1] CRAN (R 4.1.2)
 lazyeval               0.2.2     2019-03-15 [1] CRAN (R 4.1.0)
 lifecycle              1.0.3     2022-10-07 [1] CRAN (R 4.1.2)
 limma                  3.50.3    2022-04-07 [1] Bioconductor
 magick                 2.7.5     2023-08-07 [1] CRAN (R 4.1.2)
 magrittr               2.0.3     2022-03-30 [1] CRAN (R 4.1.2)
 MALDIquant             1.22.1    2023-03-20 [1] CRAN (R 4.1.2)
 MASS                   7.3-60    2023-05-04 [1] CRAN (R 4.1.2)
 Matrix                 1.6-1     2023-08-14 [1] CRAN (R 4.1.2)
 MatrixGenerics       * 1.6.0     2021-10-26 [1] Bioconductor
 matrixStats          * 1.0.0     2023-06-02 [1] CRAN (R 4.1.2)
 mime                   0.12      2021-09-28 [1] CRAN (R 4.1.0)
 MsCoreUtils            1.6.2     2022-02-24 [1] Bioconductor
 MSnbase              * 2.20.4    2022-01-16 [1] Bioconductor
 munsell                0.5.0     2018-06-12 [1] CRAN (R 4.1.0)
 mvtnorm                1.2-3     2023-08-25 [1] CRAN (R 4.1.2)
 mzID                   1.32.0    2021-10-26 [1] Bioconductor
 mzR                  * 2.28.0    2021-10-27 [1] Bioconductor
 ncdf4                  1.21      2023-01-07 [1] CRAN (R 4.1.2)
 networkD3              0.4       2017-03-18 [1] CRAN (R 4.1.0)
 nlme                   3.1-163   2023-08-09 [1] CRAN (R 4.1.2)
 norm                   1.0-11.1  2023-06-18 [1] CRAN (R 4.1.2)
 patchwork            * 1.1.3     2023-08-14 [1] CRAN (R 4.1.2)
 pcaMethods             1.86.0    2021-10-26 [1] Bioconductor
 pillar                 1.9.0     2023-03-22 [1] CRAN (R 4.1.2)
 pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.1.0)
 plotly               * 4.10.2    2023-06-03 [1] CRAN (R 4.1.2)
 plyr                   1.8.8     2022-11-11 [1] CRAN (R 4.1.2)
 png                    0.1-8     2022-11-29 [1] CRAN (R 4.1.2)
 preprocessCore         1.56.0    2021-10-26 [1] Bioconductor
 prettyunits            1.1.1     2020-01-24 [1] CRAN (R 4.1.0)
 progress               1.2.2     2019-05-16 [1] CRAN (R 4.1.0)
 promises               1.2.1     2023-08-10 [1] CRAN (R 4.1.2)
 ProtGenerics         * 1.26.0    2021-10-26 [1] Bioconductor
 purrr                  1.0.2     2023-08-10 [1] CRAN (R 4.1.2)
 qdapRegex              0.7.5     2022-05-02 [1] CRAN (R 4.1.2)
 R6                     2.5.1     2021-08-19 [1] CRAN (R 4.1.0)
 ragg                   1.2.5     2023-01-12 [1] CRAN (R 4.1.2)
 RColorBrewer           1.1-3     2022-04-03 [1] CRAN (R 4.1.2)
 Rcpp                 * 1.0.11    2023-07-06 [1] CRAN (R 4.1.2)
 RCurl                  1.98-1.12 2023-03-27 [1] CRAN (R 4.1.2)
 readr                * 2.1.4     2023-02-10 [1] CRAN (R 4.1.2)
 readxl               * 1.4.3     2023-07-06 [1] CRAN (R 4.1.2)
 rjson                  0.2.21    2022-01-09 [1] CRAN (R 4.1.2)
 rlang                  1.1.1     2023-04-28 [1] CRAN (R 4.1.2)
 rmarkdown              2.24      2023-08-14 [1] CRAN (R 4.1.2)
 Rsamtools              2.10.0    2021-10-26 [1] Bioconductor
 rstatix                0.7.2     2023-02-01 [1] CRAN (R 4.1.2)
 rstudioapi             0.15.0    2023-07-07 [1] CRAN (R 4.1.2)
 S4Vectors            * 0.32.4    2022-03-29 [1] Bioconductor
 sandwich               3.0-2     2022-06-15 [1] CRAN (R 4.1.2)
 scales                 1.2.1     2022-08-20 [1] CRAN (R 4.1.2)
 seqinr                 4.2-30    2023-04-05 [1] CRAN (R 4.1.2)
 sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.1.0)
 shape                  1.4.6     2021-05-19 [1] CRAN (R 4.1.0)
 shiny                  1.7.5     2023-08-12 [1] CRAN (R 4.1.2)
 shinydashboard         0.7.2     2021-09-30 [1] CRAN (R 4.1.0)
 stringi                1.7.12    2023-01-11 [1] CRAN (R 4.1.2)
 stringr              * 1.5.0     2022-12-02 [1] CRAN (R 4.1.2)
 SummarizedExperiment * 1.24.0    2021-10-26 [1] Bioconductor
 systemfonts            1.0.4     2022-02-11 [1] CRAN (R 4.1.2)
 textshaping            0.3.6     2021-10-13 [1] CRAN (R 4.1.0)
 tibble               * 3.2.1     2023-03-20 [1] CRAN (R 4.1.2)
 tidyr                * 1.3.0     2023-01-24 [1] CRAN (R 4.1.2)
 tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.1.2)
 tidyverse              2.0.0     2023-02-22 [1] CRAN (R 4.1.2)
 tmvtnorm               1.5       2022-03-22 [1] CRAN (R 4.1.2)
 tzdb                   0.4.0     2023-05-12 [1] CRAN (R 4.1.2)
 UniprotR             * 2.2.2     2022-10-04 [1] CRAN (R 4.1.2)
 UpSetR               * 1.4.0     2019-05-22 [1] CRAN (R 4.1.0)
 utf8                   1.2.3     2023-01-31 [1] CRAN (R 4.1.2)
 vctrs                  0.6.3     2023-06-14 [1] CRAN (R 4.1.2)
 viridis              * 0.6.4     2023-07-22 [1] CRAN (R 4.1.2)
 viridisLite          * 0.4.2     2023-05-02 [1] CRAN (R 4.1.2)
 vroom                  1.6.3     2023-04-28 [1] CRAN (R 4.1.2)
 vsn                    3.62.0    2021-10-26 [1] Bioconductor
 withr                  2.5.0     2022-03-03 [1] CRAN (R 4.1.2)
 xfun                   0.40      2023-08-09 [1] CRAN (R 4.1.2)
 XML                    3.99-0.14 2023-03-19 [1] CRAN (R 4.1.2)
 xtable                 1.8-4     2019-04-21 [1] CRAN (R 4.1.0)
 XVector              * 0.34.0    2021-10-26 [1] Bioconductor
 yaml                   2.3.7     2023-01-23 [1] CRAN (R 4.1.2)
 zlibbioc               1.40.0    2021-10-26 [1] Bioconductor
 zoo                    1.8-12    2023-04-13 [1] CRAN (R 4.1.2)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

──────────────────────────────────────────────────────────────────────────────

References

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow enables reproducible computational workflows. Nature Biotechnology 35 (4): 316–19. https://doi.org/10.1038/nbt.3820.
Eng, J. K., T. A. Jahan, and M. R. Hoopmann. 2013. Comet: an open-source MS/MS sequence database search tool.” Proteomics 13 (1): 22–24.
Huber, Wolfgang, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. 2002. Variance stabilization applied to microarray data calibration and to the quantification of differential expression.” Bioinformatics 18 (suppl_1): S96–104. https://doi.org/10.1093/bioinformatics/18.suppl_1.S96.
Kim, S., and P. A. Pevzner. 2014. MS-GF+ makes progress towards a universal database search tool for proteomics.” Nat Commun 5 (October): 5277.
MacLean, Brendan, Daniela M Tomazela, Nicholas Shulman, Matthew Chambers, Gregory L Finney, Barbara Frewen, Randall Kern, David L Tabb, Daniel C Liebler, and Michael J. MacCoss. 2010. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments.” Bioinformatics 26 (7): 966–68. https://doi.org/10.1093/bioinformatics/btq054.
Teo, Guoci, Guomin Liu, Jianping Zhang, Alexey I. Nesvizhskii, Anne Claude Gingras, and Hyungwon Choi. 2014. SAINTexpress: Improvements and additional features in Significance Analysis of INTeractome software.” Journal of Proteomics 100 (April): 37–43. https://doi.org/10.1016/j.jprot.2013.10.023.
Zhang, Xiaofei, Arne H. Smits, Gabrielle BA van Tilburg, Huib Ovaa, Wolfgang Huber, and Michiel Vermeulen. 2018. Proteome-wide identification of ubiquitin interactions using UbIA-MS.” Nature Protocols 13 (3): 530–50. https://doi.org/10.1038/nprot.2017.147.